Open Tabs
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 02-python-advanced.ipynb
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb
Kernels
- 17-ts-core.ipynb
- 031-data-wrangling-with-mongodb.ipynb
- 04-pandas-advanced.ipynb
- 18-ts-models.ipynb
- 11-databases-mongodb.ipynb
- 10-databases-sql.ipynb
- 033-autoregressive-models.ipynb
- 035-assignment.ipynb
- 032-linear-regression-with-time-series-data.ipynb
- 034-arma-models-and-hyperparameter-tuning.ipynb
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 02-python-advanced.ipynb
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb
Terminals
- .ipynb_checkpoints35 minutes ago
- data2 months ago
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb37 minutes ago
- 01-python-getting-started.ipynb2 months ago
- 02-python-advanced.ipynb43 minutes ago
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb2 months ago
- 03-pandas-getting-started.ipynb2 months ago
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb33 minutes ago
- 04-pandas-advanced.ipynb2 months ago
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb32 minutes ago
- 05-pandas-summary-statistics.ipynb2 months ago
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb11 minutes ago
- 06-visualization-matplotlib.ipynb2 months ago
- 07-visualization-pandas.ipynb2 months ago
- 08-visualization-plotly.ipynba month ago
- 09-visualization-seaborn.ipynba month ago
- 10-databases-sql.ipynb2 months ago
- 11-databases-mongodb.ipynb19 days ago
- 12-ml-core.ipynb2 months ago
- 13-ml-data-pre-processing-and-production.ipynba month ago
- 14-ml-classification.ipynb2 months ago
- 15-ml-regression.ipynba month ago
- 16-ml-unsupervised-learning.ipynb2 months ago
- 17-ts-core.ipynb2 months ago
- 18-ts-models.ipynb2 months ago
- 19-linux-command-line.ipynb2 months ago
- 20-statistics.ipynb2 months ago
- 21-python-object-oriented-programming.ipynb2 months ago
- 22-apis.ipynb2 months ago
- main.py3 months ago
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 02-python-advanced.ipynb
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb
xxxxxxxxxxPython: Getting Started
xxxxxxxxxxSimple Calculations¶
xxxxxxxxxxIn addition to all the more complex things you can do in Python, it is also useful for completing simple mathematical operations.
xxxxxxxxxxAddition and Subtraction
Add numbers together like this:
xxxxxxxxxx1 + 12
xxxxxxxxxxNotice that you don't need to include = in order to make the operation work.
Subtract numbers like this:
2 - 11
xxxxxxxxxxDivision
Divide numbers like this:
4 / 22.0
xxxxxxxxxxRemember that Python will return an error if you try to divide by 0!
xxxxxxxxxxModulo Division
Perform modulo division like this:
5 % 21
xxxxxxxxxxMultiplication
Multiply numbers like this:
4 * 28
xxxxxxxxxxAdd exponents to numbers like this:
4**216
xxxxxxxxxxOrder of Operations¶
Just like everywhere else, Python works on a specific order of operations. That order is:
Parentheses Exponents Multiplication Addition Subtraction
If you remember being taught PEMDAS, then this is the same thing! All the operations in Python work with levels of parentheses, so if you wanted to add two numbers together, and then add another number to the result, the code would look like this:
(6 + 4) + 414
xxxxxxxxxxThings get more complicated when we add more terms to the equation, but the principle remains the same: if you open a set of parentheses, make sure you close it too. Here's an example that uses all the PEMDAS possibilities:
(((5**3 + 7) * 4) / 16 + 9 - 2)40.0
xxxxxxxxxxData Types¶
xxxxxxxxxxThere are lots of different types of data in the world, and Python groups that data into several categories.
Boolean (bool):
- Any data which can be expressed as either
TrueorFalse. - Used when comparing two values. For example, if you enter
10 > 9, Python will returnTrue.
String (str):
- Data that involves text — either letters, numbers, or special characters.
- Strings are enclosed in either single- or double-quotation marks:
"string-1"or'string-2'.
Numeric (int, float, complex):
- Data that can be expressed numerically.
- An integer, or
int, is a whole number, positive or negative, without decimals, of unlimited length:123. - A floating-point number, or
float, is a number, positive or negative, containing one or more decimals:123.01. - A complex number, or
complex, are imaginary numbers, designated by aj:(3 + 6j).
Sequence (list, tuple, set):
- Data that is a collection of discrete items.
- A
listis collection that is ordered and changeable. It's designated using square brackets[], and items can be of different data types:["red", 1, 1.03, 1]. - A
tupleis a collection which is ordered and unchangeable. It's designated using parentheses():("red", 1, 1.03, 1). - A
setis a collection which is unordered, unchangeable, and does not permit duplicate items. It's designated using curly brackets{}:{"red", 1, 1.03}.
Mapping (dict):
- Dictionaries store data in key-value pairs. They're designated using curly brackets
{}, like aset, but notice that keys and values are associated with each other using a colon:. Each pair is separated from the next using a comma,.
dict1 = {
"department": "quindio",
"property_type": "house",
"price_usd": 330899.98
}
Binary (bytes, bytearray, memoryview):
- Used to manipulate and display binary data. That is, data that can be expressed with integers represented with base 2.
- Unlike the other data types described above,
binarytypes are not human-readable.
xxxxxxxxxxLists¶
xxxxxxxxxxIn Python, a list is a collection of data that stores multiple items in a single variable. These items must be ordered, able to be changed, and can be duplicated. A list can store data of multiple types; not all the items in the list need to be the same type.
xxxxxxxxxxCreating Lists¶
Lists can be as long or as short as you like. Let's create a short list based on data from the Colombian real estate market to give us something to work with.
Lists are written with square brackets. Code for a short list that shows the price of houses in US dollars looks like this:
price_usd = [97919.38, 300511.20, 293758.14]print(price_usd)[97919.38, 300511.2, 293758.14]
xxxxxxxxxxWorking with Lists¶
After you've created a list, you can access any item on the list by referring to the item's index number. Keep in mind that in Python, the first item in a list is always 0.
Let's access the second item of our price_usd list.
print(price_usd[1])300511.2
xxxxxxxxxxPractice
Try it yourself! Create and print a list that shows the area of the houses, called area_m2. Include the items 187.0, 82.0, and 235.0.
xxxxxxxxxxarea_m2 = [187.0, 82.0,235.0]area_m2 = [187.0, 82.0,235.0]area_m2[187.0, 82.0, 235.0]
xxxxxxxxxxIf we want to access the an item at the end of the list, we can use negative indexing. In negative indexing, -1 refers to the last item, -2 to the second to last, and so on.
Let's access the last item in our department list.
print(price_usd[-1])293758.14
xxxxxxxxxxPractice
Try accessing the second item in your area_m2 list.
x
area_m2[1]82.0
xxxxxxxxxxTry accessing the last item in the same list.
xxxxxxxxxxarea_m2[-1]235.0
xxxxxxxxxxAppending Items¶
It's also possible to add an item to a list that already exists using the append method like this:
xxxxxxxxxxprice_usd.append(540244.86)price_usd.append(540244.86)xxxxxxxxxxPractice
Add the item 195.0 to your area_m2 list.
xxxxxxxxxxarea_m2.append(195.0)area_m2.append(195.0)print(area_m2)[187.0, 82.0, 235.0, 195.0]
xxxxxxxxxxAggregating Items¶
We can also aggregate items on a list to make analyzing the list more useful. For example, if we wanted to know the total value in US dollars of the houses on our price_usd list, we could use the sum method.
total_usd = sum(price_usd)xxxxxxxxxxWe might also be interested in the average value in US dollars of the houses on the same list. To find the average, we add the len method to the sum method.
xxxxxxxxxxaverage_usd = sum(price_usd) / len(price_usd)average_usd = sum(price_usd) / len(price_usd)xxxxxxxxxxPractice
Try it yourself! Calculate the total area of the houses on your area_m2 list, and find the average area of all the houses on the list.
xxxxxxxxxxprint(total_area_m2)total_area_m2 = sum(area_m2)print(total_area_m2)699.0
xxxxxxxxxxaverage_area_m2 = sum(area_m2)/len(area_m2)average_area_m2 = sum(area_m2)/len(area_m2)print(average_area_m2)174.75
xxxxxxxxxxZipping Items¶
Finally, it might be useful to combine -- or zip -- two lists together. For example, we might want to create a new list that pairs the values in our price_usd list with our area_m2 list. To do that, we use the zip method. The code looks like this:
You might have noticed that the above code involving putting one list (in this case, new_list) inside another list (in this case, list). This approach is called a generator, and we'll come back to what that is and how it works later in the course.
xxxxxxxxxxnew_list = zip(price_usd, area_m2)new_list = zip(price_usd, area_m2)zipped_list = list(new_list)zipped_list[(97919.38, 187.0), (300511.2, 82.0), (293758.14, 235.0), (540244.86, 195.0)]
xxxxxxxxxxYou might have noticed that the above code involving putting one list (in this case, new_list) inside another list (in this case, list). This approach is called a generator, and we'll come back to what that is and how it works later in the course.
xxxxxxxxxxPractice
Try it yourself! Create a list called area_m2 that includes the terms 235.0, 130.0, and 137.0, then create another list called price_cop that includes the terms 400000000.0, 850000000.0, and 475000000.0. Then zip them together to create a new list called area_price, and print the result.
area_m2 = [235.0, 130.0, 137.0]price_cop = [400000000.0, 850000000.0,475000000.0]new_list2 = zip(area_m2,price_cop)zippedlist = list(new_list2)zippedlist[(235.0, 400000000.0), (130.0, 850000000.0), (137.0, 475000000.0)]
xxxxxxxxxxPython for Loops¶
xxxxxxxxxxA for Loop is used for executing a set of statements for each item in a list.
xxxxxxxxxxWorking with for Loops¶
There can be as many statements as there are items in the list, but to keep things manageable, let's use our list of real estate values in Colombia.
price_usd = [97919.38, 300511.20, 293758.14, 540244.86]print(price_usd)[97919.38, 300511.2, 293758.14, 540244.86]
xxxxxxxxxxWe might want to see each of the values in the list, so we insert a for Loop:
xxxxxxxxxxprice_usd = [97919.38, 300511.20, 293758.14, 540244.86]price_usd = [97919.38, 300511.20, 293758.14, 540244.86]for x in price_usd: print(x)97919.38 300511.2 293758.14 540244.86
xxxxxxxxxxNote that the print command is indented.
xxxxxxxxxxPractice
Try it yourself using the area_m2 list:
area_m2 = ...xxxxxxxxxxPython Dictionaries¶
xxxxxxxxxxIn Python, a dictionary is a collection of data that occurs in an order, is able to be changed, and does not allow duplicates. Data in a dictionary are always presented as keys and values, and those key-value pairs cannot be duplicated in the dataset.
xxxxxxxxxxCreating Dictionaries¶
Dictionaries can be as big or as small as you like. Let's create a small dictionary based on data from the Colombian real estate market to give us something to work with.
Dictionaries are written with curly brackets, with key-value pairs inside. Code for a small dictionary looks like this:
colomdict = { "property_type": "house", "department": "quindio", "area": 235.0,}colomdict{'property_type': 'house', 'department': 'quindio', 'area': 235.0}xxxxxxxxxxPractice
Try it yourself! Create and print a dictionary called bogota with the key-value pairs "price_usd": 121,555.09, "area_m2": 82.0, and "property_type": "house"
bogota = ...print(bogota)Ellipsis
xxxxxxxxxxWorking with Dictionaries¶
After you've created a dictionary, you can access any item by using its key name inside square brackets.
Going back to our example dictionary, let's access the value for "department".
x = colomdict["department"]print(x)quindio
xxxxxxxxxxPractice
Try accessing the value for price_usd in the Bogotá dictionary you created above.
x = ...xxxxxxxxxxYou can also use get to retrieve a value. That looks like this:
x = colomdict.get("department")xxxxxxxxxxPractice
Now try accessing the value for area_m2 using the get method, and print the result.
x = colomdict.get("area_m2")xxxxxxxxxxxDictionary Keys¶
xxxxxxxxxxSometimes you want to know what the keys are in the dictionary, or you want to iterate through the dictionary. In such cases, you need to write code to access all keys in a dictionary by utilizing the keys method:
colomdict.keys()dict_keys(['property_type', 'department', 'area'])
xxxxxxxxxxIf you need to use the keys in a list, you can transform the output:
list(colomdict.keys())['property_type', 'department', 'area']
xxxxxxxxxxYou can also iterate though colomdict.keys() without converting it to a list:
for k in list(colomdict.keys()): print(k)property_type department area
xxxxxxxxxxPractice
print the value of each key in the colomdict dictionary
xxxxxxxxxxJSON¶
xxxxxxxxxxJSON stands for Java Script Object Notation, and it's a text format for storing and transporting data.
xxxxxxxxxxWorking with JSON¶
JSON works by creating key-value pairs, where the key is data that can be represented by letters (called a string). JSON values can be strings, numbers, objects, arrays, boolean data, or null. JSON usually comes as a list of dictionaries, which look like this:
Here's an example from our colombia-real-estate-1 dataset with two key-value pairs that both include string values:
[ {"property_type": "house", "department": "bogota"}, {"property_type": "house", "department": "bogota"}, {"property_type": "house", "department": "bogota"},][{'property_type': 'house', 'department': 'bogota'},
{'property_type': 'house', 'department': 'bogota'},
{'property_type': 'house', 'department': 'bogota'}]xxxxxxxxxxHere's an simplified example from our colombia-real-estate-1 dataset with two key-value pairs that both include string values:
{"property_type": "house", "department": "bogota"}{'property_type': 'house', 'department': 'bogota'}xxxxxxxxxxJSON pairs with numbers look like this:
{"area_m2": 187.0, "price_usd": 330899.98}{'area_m2': 187.0, 'price_usd': 330899.98}xxxxxxxxxxWhen you mix more than one type of value, it looks like this:
{"property_type": "house", "price_usd": 330899.98}{'property_type': 'house', 'price_usd': 330899.98}xxxxxxxxxxReferences & Further Reading¶
- A guide to basic math operations in Python
- Python documentation on built-in data types
- Summary of Python data types
- Tutorial on type conversion in Python
- A description of how dictionaries work in Python
- An introduction to JSON
- An introduction to lists in Python
- How to zip lists
- Calculating mean, median, and mode in Python
- A brief tutorial of For Loops
xxxxxxxxxxCopyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxxPython: Advanced
xxxxxxxxxxStrings¶
xxxxxxxxxxWhat's a string? ¶
Recall that a string is any kind of information that can be represented with letters.
xxxxxxxxxxWorking with strings ¶
When working with data, often files and directories have names that fit a pattern. For example, data on property prices in Colombia and Mexico might be stored in files named:
colombia-real-estate-1.csvcolombia-real-estate-2.csvcolombia-real-estate-3.csvmexico-city-real-estate-1.csvmexico-city-real-estate-2.csvmexico-city-real-estate-3.csvmexico-city-real-estate-4.csvmexico-city-real-estate-5.csvmexico-city-test-features.csvmexico-city-test-labels.csv
When the list of files is short like this one, it's not difficult to find the ones we want, but if the list were longer, we might need some help. If we're only interested in finding files that deal with Mexico, we could search the files for files beginning with mexico-city-real-estate-. To do this, we'll use the .glob function. The code looks like this:
import globglob.glob("./data/mexico-city-real-estate-[0-9].csv")['./data/mexico-city-real-estate-3.csv', './data/mexico-city-real-estate-1.csv', './data/mexico-city-real-estate-5.csv', './data/mexico-city-real-estate-4.csv', './data/mexico-city-real-estate-2.csv']
xxxxxxxxxxThe .glob function allows for pattern matching. In this example [0-9] allows for any digit between 0 and 9, but there are lots of other patterns that .glob can find. Here are a few of the more common ones:
*Match any number of characters?Match a single character of any kind[a-z]Match any lower case alphabetical character in the current locale[A-Z]Match any upper case alphabetical character in the current locale[!a-z]Do not match any lower case alphabetical character in the current locale
So, if we wanted to find all the files from Mexico City, we would use code like this:
glob.glob("./data/mexico-city*")['./data/mexico-city-real-estate-3.csv', './data/mexico-city-test-labels.csv', './data/mexico-city-real-estate-1.csv', './data/mexico-city-real-estate-5.csv', './data/mexico-city-real-estate-4.csv', './data/mexico-city-real-estate-2.csv', './data/mexico-city-test-features.csv']
xxxxxxxxxxPractice
Try it yourself! Find only the data files containing the word test.
xxxxxxxxxxSo far, you have only searched for files in one specific directory. It's also possible to search for files in subdirectories. To get a listing of all notebook files starting from the directory above this one and all others below it, you can use:
glob.glob("../**/*.ipynb", recursive=True)['../050-bankruptcy-in-poland/056-data-dictionary.ipynb', '../050-bankruptcy-in-poland/054-gradient-boosting.ipynb', '../050-bankruptcy-in-poland/055-assignment.ipynb', '../050-bankruptcy-in-poland/051-working-with-json.ipynb', '../050-bankruptcy-in-poland/052-imbalanced-data.ipynb', '../050-bankruptcy-in-poland/053-random-forest.ipynb', '../030-air-quality-in-nairobi/035-assignment.ipynb', '../030-air-quality-in-nairobi/034-arma-models-and-hyperparameter-tuning.ipynb', '../030-air-quality-in-nairobi/031-data-wrangling-with-mongodb.ipynb', '../030-air-quality-in-nairobi/032-linear-regression-with-time-series-data.ipynb', '../030-air-quality-in-nairobi/033-autoregressive-models.ipynb', '../060-consumer-finances-in-usa/064-interactive-dash-app.ipynb', '../060-consumer-finances-in-usa/063-clustering-multiple-features.ipynb', '../060-consumer-finances-in-usa/065-assignment.ipynb', '../060-consumer-finances-in-usa/061-exploring-data.ipynb', '../060-consumer-finances-in-usa/066-data-dictionary.ipynb', '../060-consumer-finances-in-usa/062-clustering-two-features.ipynb', '../@textbook/13-ml-data-pre-processing-and-production.ipynb', '../@textbook/03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb', '../@textbook/16-ml-unsupervised-learning.ipynb', '../@textbook/05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb', '../@textbook/08-visualization-plotly.ipynb', '../@textbook/01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb', '../@textbook/09-visualization-seaborn.ipynb', '../@textbook/17-ts-core.ipynb', '../@textbook/21-python-object-oriented-programming.ipynb', '../@textbook/15-ml-regression.ipynb', '../@textbook/10-databases-sql.ipynb', '../@textbook/20-statistics.ipynb', '../@textbook/01-python-getting-started.ipynb', '../@textbook/14-ml-classification.ipynb', '../@textbook/12-ml-core.ipynb', '../@textbook/22-apis.ipynb', '../@textbook/04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb', '../@textbook/05-pandas-summary-statistics.ipynb', '../@textbook/04-pandas-advanced.ipynb', '../@textbook/06-visualization-matplotlib.ipynb', '../@textbook/06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb', '../@textbook/02-python-advanced.ipynb', '../@textbook/19-linux-command-line.ipynb', '../@textbook/07-visualization-pandas.ipynb', '../@textbook/18-ts-models.ipynb', '../@textbook/03-pandas-getting-started.ipynb', '../@textbook/11-databases-mongodb.ipynb', '../070-ds-admissions-in-wqu/071-meet-ds-lab-applicants.ipynb', '../070-ds-admissions-in-wqu/072-etl-class.ipynb', '../070-ds-admissions-in-wqu/073-chi-square-test.ipynb', '../070-ds-admissions-in-wqu/075-assignment.ipynb', '../070-ds-admissions-in-wqu/074-dashboard.ipynb', '../080-volatility-forecasting-in-india/083-garch.ipynb', '../080-volatility-forecasting-in-india/081-working-with-apis.ipynb', '../080-volatility-forecasting-in-india/084-model-deployment.ipynb', '../080-volatility-forecasting-in-india/082-test-driven.ipynb', '../080-volatility-forecasting-in-india/085-assignment.ipynb', '../020-housing-in-buenos-aires/022-price-and-location.ipynb', '../020-housing-in-buenos-aires/021-price-and-size.ipynb', '../020-housing-in-buenos-aires/025-assignment.ipynb', '../020-housing-in-buenos-aires/023-price-and-neighborhood.ipynb', '../020-housing-in-buenos-aires/024-price-and-everything.ipynb', '../040-earthquake-damage-in-nepal/045-assignment.ipynb', '../040-earthquake-damage-in-nepal/046-data-dictionary.ipynb', '../040-earthquake-damage-in-nepal/042-logistic-regression.ipynb', '../040-earthquake-damage-in-nepal/044-demographics.ipynb', '../040-earthquake-damage-in-nepal/043-decision-tree.ipynb', '../040-earthquake-damage-in-nepal/041-sqlite.ipynb', '../010-housing-in-mexico/015-assignment.ipynb', '../010-housing-in-mexico/011-tabular-and-tidy-data.2022-12-23T07-21-48-609Z.ipynb', '../010-housing-in-mexico/012-data-wrangling-with-pandas.ipynb', '../010-housing-in-mexico/013-exploratory-data-analysis.2022-12-23T07-21-48-609Z.ipynb', '../010-housing-in-mexico/013-exploratory-data-analysis.ipynb', '../010-housing-in-mexico/012-data-wrangling-with-pandas.2022-12-23T07-21-48-609Z.ipynb', '../010-housing-in-mexico/011-tabular-and-tidy-data.ipynb', '../010-housing-in-mexico/014-size-or-location.ipynb']
xxxxxxxxxxManipulating Strings¶
xxxxxxxxxxWe can split a string on a specific character, transforming it into a list:
file_name = "mexico-city-real-estate-1"file_name.split("-")['mexico', 'city', 'real', 'estate', '1']
xxxxxxxxxxThis can be useful for a set of strings with similar naming conventions for us to access specific information. For example, we can collect the digit as the file number in the string:
file_name_number = file_name.split("-")[-1]file_name_number'1'
xxxxxxxxxxAnother useful function to manipulate a string is replace. We can use it to replace any element in the string with another item. Note we put what we what to replace in the front and what we replace it with in the back:
file_name = "mexico-city-real-estate-1"modified_file_name = file_name.replace("-", "_")modified_file_name'mexico_city_real_estate_1'
xxxxxxxxxxPractice
Change "mexico-city-real-estate-1" to "mexico-city-real-estate" using replace. Hint: To delete a character, we can replace it with an empty string ("").
file_name = "mexico-city-real-estate-1"modified_file_name = ...modified_file_nameEllipsis
xxxxxxxxxxWorking with f-strings ¶
xxxxxxxxxxWe usually use print to examine output in Python, but most of the examples we've been printing have been relatively short. Formatted strings are helpful for all sorts of reasons, but when we're assembling and formatting a long string, using the print function can be difficult and time-consuming. Along the same lines, it's also useful to directly evaluate variables and expressions within strings. To do those things, we create f"" strings. The code looks like this:
Home = "Mexico City"f"My home is {Home}"'My home is Mexico City'
import datetimepython_birthday = datetime.datetime(year=1991, month=2, day=20)print( f"Python first appeared on {python_birthday:%B %d} in the year {python_birthday:%Y}.")now = datetime.datetime.now()print(f"Python is {now.year - python_birthday.year} years old.")Python first appeared on February 20 in the year 1991. Python is 32 years old.
xxxxxxxxxxPractice
Mexico-Tenochtitlan was established on 13 March 1325; use f-strings to indicate how long ago that was.
x
mexico_founding = datetime.datetime(year=1325, month=3, day=13)now = datetime.datetime.now()f"Mexico-Tenochtitlan was established {now - mexico_founding} years ago."'Mexico-Tenochtitlan was established 254915 days, 8:49:47.247481 years ago.'
xxxxxxxxxxSources and further reading
xxxxxxxxxxIterators and Iterables¶
xxxxxxxxxxA list is a container with a countable number of values. Because that's true, a list is an iterable, meaning that we can iterate through it one item at a time. In other words, iterators retrieve these values only when we ask for them. If we try to bring in a large database — over a million values, for example — asking for every action to be applied to every value will take up a huge amount of memory. Iterators are helpful because they allow us to free up memory to use for other tasks. We'll spend more time working with databases later on, but for now, let's take a look at some code:
from pymongo import MongoClientclient = MongoClient(host="localhost", port=27017)(list(client.list_databases()))xxxxxxxxxxSetting aside the first two lines of code, we have a method which has returned a list of four databases. If we want to examine each database by itself, we can create a variable called results, and then try to print it.
results = client.list_databases()print((results))xxxxxxxxxxThat doesn't seem like much of anything, but if we add the iterator next(), we'll get back something more useful.
print(next(results))xxxxxxxxxxThat makes much more sense! As you can see, this returns the first row. If we do it again, we'll get the second row:
print(next(results))xxxxxxxxxxWe can keep doing this until we get to the end of the list, at which point we'll get an error telling us that there's nothing left to print. Every time we use the next() method, we're using it as an iterator to iterate through our iterable.
xxxxxxxxxxList Comprehension ¶
xxxxxxxxxxList comprehension is used to iterate through lists without explicitly writing loops, which is especially useful for filtering data according to a specific condition.
Let's take a look at a list that shows property prices in Mexican pesos.
price_mexican_pesos = [ 35000000.0, 2000000.0, 2700000.0, 6347000.0, 6994543.16, 6617835.61, 670000.0,]xxxxxxxxxxBut maybe we're interested in comparing these prices to property values in Colombia. To do that, we'll need to figure out how to express the data on our list in Colombian pesos. We can use a for loop to make the conversion based on an exchange rate of 1 Mexican peso to 190 Colombian pesos. The code looks like this:
price_colombian_pesos = []for price in price_mexican_pesos: price_colombian_pesos.append(price * 190)print(price_colombian_pesos)xxxxxxxxxxBut what if we could do the same thing, but using fewer lines? That's what list comprehension is for. The code looks like this:
price_colombian_pesos = [price * 190 for price in price_mexican_pesos]print(price_colombian_pesos)xxxxxxxxxxWe can use list comprehension to find all the house entries in this list of properties, like this:
records = [ 'sell,apartment,|México|Distrito Federal|Benito Juárez|,"19.384467,-99.135872",1860000.0,MXN,1843173.75,97996.85,,70.0,,26571.42857142857', 'sell,apartment,|México|Distrito Federal|Iztapalapa|Cerro de La Estrella|,"19.324123,-99.074132",700000.0,MXN,693667.44,36880.53,,50.0,,14000.0', 'sell,house,|México|Distrito Federal|La Magdalena Contreras|San Jerónimo Lídice|,"19.317653,-99.236291",3350000.0,MXN,3319694.98,176499.72,,350.0,,9571.42857142857', 'sell,apartment,|México|Distrito Federal|Cuauhtémoc|,"19.446313,-99.14006",405108.0,MXN,401443.16,21343.71,,50.0,,8102.16', 'sell,house,|México|Distrito Federal|Coyoacán|,"19.303906,-99.107812",7200000.0,MXN,7134866.79,379342.68,,250.0,,28800.0', 'sell,apartment,|México|Distrito Federal|Benito Juárez|,"19.374171,-99.181264",2425000.0,MXN,2403062.73,127764.72,,96.0,,25260.416666666668', 'sell,apartment,|México|Distrito Federal|Tlalpan|,"19.287428,-99.122283",1250000.0,MXN,1238692.07,65858.1,,65.0,,19230.76923076923', 'sell,house,|México|Distrito Federal|Venustiano Carranza|,"19.436436,-99.117256",1362000.0,MXN,1349678.96,71758.99,,98.0,,13897.959183673467', 'sell,apartment,|México|Distrito Federal|Benito Juárez|,"19.382429,-99.160199",2250000.0,MXN,2229645.73,118544.58,,90.0,,25000.0', 'sell,house,|México|Distrito Federal|Tlalpan|Granjas Coapa|,"19.300456,-99.115741",3900000.0,MXN,3864719.42,205477.28,,153.0,,25490.19607843137', 'sell,apartment,|México|Distrito Federal|Álvaro Obregón|,"19.363167,-99.276028",9000000.0,MXN,8918583.49,474178.35,,188.0,,47872.34042553192', 'sell,house,|México|Distrito Federal|Coyoacán|Villa Coyoacán|,"19.348694,-99.16291",1150000.0,USD,21629775.0,1150000.0,,555.0,,2072.072072072072', 'sell,house,|México|Distrito Federal|Tlalpan|,"19.300963,-99.144237",7500000.0,MXN,7432152.81,395148.62,,385.0,,19480.51948051948', 'sell,house,|México|Distrito Federal|Coyoacán|Paseos de Taxqueña|,"19.343979,-99.124863",6310000.0,MXN,6252917.98,332451.71,,183.0,,34480.87431693989', 'sell,apartment,|México|Distrito Federal|Coyoacán|San Diego Churubusco|,"19.354509,-99.149765",10000000.0,MXN,9909537.15,526864.83,,293.0,,34129.69283276451',][row for row in records if "house" in row]xxxxxxxxxxPractice
Explore the list records in the list, and find all entries located in Tlalpan
xxxxxxxxxxFunctions¶
xxxxxxxxxxWhen we code in Python, we want to create readable programs. One of the easiest ways to make a program readable is by not repeating sections of code that do the same thing. We do that by using functions. For example, you might have surface area of a property in square meters, but you want to see it in square feet. Keeping in mind that one square meter = 10.76391 square feet, you can write a function that starts with the area in square meters, and gives as output the area in square feet. The code looks like this:
def m2toft2(area_meter2): area_feet2 = 10.76391 * area_meter2 return area_feet2xxxxxxxxxxThe code above defines a function called m2toft2 that takes in a single input, called area_meters, and returns a single output, called area_feet. Let's try another one:
m2toft2(4)xxxxxxxxxxA function by itself can be difficult to understand, so let's add some comments describing what the function does.
def m2toft2(area_meter2): """ This function takes in as input the area in meters squared and returns as an output the area in square feet input: area_meter2, the area in square meters output: area_feet2, the area in square feet """ area_feet2 = 10.76391 * area_meter2 return area_feet2xxxxxxxxxxThis way, if you forget what m2toft2 does, Python will be able to remind you, like this:
help(m2toft2)xxxxxxxxxxThis can be especially useful for large programs with lots of functions, some of which might have multiple arguments. For example, a function might take a list of areas of properties in square meters and a list of prices per square meter, and return lists with area in square feet and price per square foot. That function would look like this:
def convert_area(area_meters2): """ This function takes in a list of area in square meters and returns area in square feet input: area_meters2, area in square meters output: area_feet, area in square feet """ area_feet2 = [item * 10.76391 for item in area_meters2] return area_feet2xxxxxxxxxxLet's try it and see what happens:
surface_total_in_m2 = [1860000.0, 700000.0, 3350000.0]surface_total_in_ft2 = convert_area(surface_total_in_m2)print(surface_total_in_ft2)xxxxxxxxxxPractice
Python comes with many predefined functions. Try this one:
help(max)xxxxxxxxxxNow write a function that returns the greatest per-unit-area property price for a list of property prices per-unit-area, and then use your function for the list price_usd_per_m2.
price_usd_per_m2 = [97996.85, 36880.53, 176499.72]def find_max_price_per_area(price_per_meter2): """Find the most expensive price per unit areas given a list Parameters ---------- price_per_meter : list of int List with price per unit area of each property Returns ------- the price of the most expensive property per unit area """ return ...find_max_price_per_area(price_usd_per_m2)xxxxxxxxxxPractice
The previous example does not extend the max function that is in Python. Keeping in mind that list comprehensions can be used to iterate through lists, use list comprehension or loops to write a function which, given a list of property areas and a corresponding list of property prices per unit area, returns the total price of the most expensive property.
def find_max_price(area_meter2, price_per_meter2): """ Given two lists, the first with areas of properties and the second with price per unit area, this function returns the most expensive property using list comprehension input: area_meter2, list with the total area of each property input: price_per_meter2, list with price per unit area of each property output: the price of the most expensive property """ return ...find_max_price(surface_total_in_m2, price_usd_per_m2)xxxxxxxxxxLambda Functions ¶
xxxxxxxxxxThe function definitions we've been working with so far are fine for most purposes, but they can easily become a little long. When that happens, you might want to use a shorter method to expressing a function; that's what lambda functions are for. Here's code for a function which adds 3 to a number.
add_three = lambda a: a + 3 # noqa: E731xxxxxxxxxxNow that we've defined our function, let's try it out. If we wanted the function to add 3 to 5, the code would look like this:
add_three(5)xxxxxxxxxxPractice
Try it yourself! Write a lambda function called sub_4 which will subtract 4 from a given number, and then try it out with the number 7.
sub_4 = lambda a: a - 4 # noqa: E731xxxxxxxxxxWorking with Errors¶
xxxxxxxxxxError Handling¶
xxxxxxxxxxError handling is a very important part of coding. It will make sure our code runs smoothly even with edge cases. try and except are the syntax we use in error handling. Let's create a function to demonstrate how this works.
We start with a function that calculates the quotient of two numbers. There are two inputs of the function: nominator and denominator. The function works only when:
- both inputs are numerical numbers
- the denominator is not zero
We can use try and except to make sure the function runs smoothly even with error inputs.
def get_quotient(nominator, denominator): try: quotient = nominator / denominator return quotient except: # noQA E722 return print("function not working")xxxxxxxxxxThe function will return the quotient as long as there are no errors.
get_quotient(1, 2)xxxxxxxxxxThe function will go the except section and print a message when inputs are wrong:
get_quotient(1, 0)xxxxxxxxxxRaising Errors¶
xxxxxxxxxxEven though the previous function can handle the errors, it doesn't tell us what is the error causing the issue. In this case, we can print out the error message:
def get_quotient(nominator, denominator): try: quotient = nominator / denominator return quotient except Exception as e: return print(e)xxxxxxxxxxNow we can see the error message when denominator is zero:
get_quotient(1, 0)xxxxxxxxxxAnd the error message when inputs are not numerical:
get_quotient(1, "0")xxxxxxxxxxPractice
Try it yourself! Rebuild the find_max_price_per_area function by adding error handling here, and make sure to raise errors when encounter one.
def find_max_price_per_area(price_per_meter2): """Find the most expensive price per unit areas given a list Parameters ---------- price_per_meter : list of int List with price per unit area of each property Returns ------- the price of the most expensive property per unit area """ price_usd_per_m2 = [97996.85, 36880.53, 176499.72]find_max_price_per_area(price_usd_per_m2)price_usd_per_m2 = ["97996.85", 36880.53, 176499.72]find_max_price_per_area(price_usd_per_m2)xxxxxxxxxxFiles¶
xxxxxxxxxxCreate files using Context Manager¶
A context manager allows you to allocate and release resources precisely when you want to. The most widely used example of context managers is the with statement. Suppose you have two related operations which you would like to execute as a pair, with a block of code in between. Context managers allow you to do specifically that. For example:
with open("data/example.txt", "w") as f: f.write("Hello")xxxxxxxxxxThe code above will create a file called example.txt inside the data directory, with only one line: "Hello". We can add multiple lines to the file by adding the /n to separate the line.
with open("data/example.txt", "w") as f: f.write("Hello") f.write("\n") f.write("Hola")xxxxxxxxxxPractice
Create a txt file named practice.txt inside the data directory with three lines using context manager.
xxxxxxxxxxSaving and Loading Files with joblib¶
xxxxxxxxxxWe can also use joblib's dump and load functions to save and load data. Besides saving and loading data, we can also save and load trained models for later use. Let's say an example here. First, we import joblib and train a model for the iris dataset:
from joblib import dump, loadfrom sklearn import datasets, svmiris = datasets.load_iris()X, y = iris.data, iris.targetclf = svm.SVC()clf.fit(X, y)# Saving model to a pathdump(clf, "data/trained_model.pkl")# Load data from a path and make predictions againmodel = load("data/trained_model.pkl", mmap_mode="r")model.predict(X)xxxxxxxxxxWorking with Filepaths¶
xxxxxxxxxxA filepath is a directory to a specific file. Python uses the os module to work with path names and access files in the local directory. Here are some common use cases of the os module:
xxxxxxxxxxos.getcwd() points to the current working directory:
import osprint(f"Current working directory is at {os.getcwd()}")xxxxxxxxxxLet's see the following functions with an example. The path name shows the directory to a file called myfile.txt:WQU WorldQuant University Applied Data Science Lab QQQQ
pathname = "/home/jovyan/work/ds_curriculum/myfile.txt"xxxxxxxxxxFirst, os.path.abspath() returns a path name to the path passed as the parameter to this function.
os.path.abspath(pathname)xxxxxxxxxxIf you only give a file name as the parameter, it will return a path with the current working directory and the file name:
filename = "myfile.txt"os.path.abspath(filename)xxxxxxxxxxos.path.dirname() will leave the file name part, and only show the directory of a path name.
os.path.dirname(pathname)xxxxxxxxxxOn the other hand, os.path.basename() will only show the file name of a path name:
os.path.basename(pathname)xxxxxxxxxxThe os.path.split() function splits a full path name and returns a tuple containing the path and filename. The first part of the tuple is the path to the file, the second part is the file:
os.path.split(pathname)xxxxxxxxxxThe os.path.join() function constructs a path name out of one or more partial path names. Note the function will add an extra slash to the path name before joining it to the filename.
print(os.path.join("/new_directory", "myfile.txt"))xxxxxxxxxxTesting Code¶
xxxxxxxxxxPython has a very useful function isinstance that can be used to check the type of an object. The following example checks whether the object is an integer.
xxxxxxxxxxisinstance(1, int)True
xxxxxxxxxxYou can leverage the any and all function to apply conditions on multiple objects. For example, the following function checks whether any element of the inputs list is string:
inputs = [1, 2]isinstance(any(inputs), str)False
xxxxxxxxxxSince they are all integers, the function returns False. The following function checks whether all elements in the inputs list are either integers or floats:
xxxxxxxxxxisinstance(all(inputs), (int, float))True
xxxxxxxxxxYou can add an assert function to your code to make sure you are using the right input data before you run the code:
xxxxxxxxxxassert isinstance(1, int)xxxxxxxxxxIf the statement after assert is True, the code will continue and nothing will be produced. If assert is False, then it will throw an error. You will use the error to debug your code.
xxxxxxxxxxWe can modify the get_quotient functioning by adding isinstance:
xxxxxxxxxxdef get_quotient(nominator, denominator): inputs = [nominator, denominator] if isinstance(all(inputs), (int, float)): quotient = nominator / denominator return quotient else: return print("denominator cannot be zero")xxxxxxxxxxAnother function that is useful is hasattr, which checks what are the attributes for an object. The object can be a defined class, or a customized class. In the following example, I am checking whether the str class has the isupper and isstring attributes using hasattr. The function will return a boolean variable, either True or False:
xxxxxxxxxxprint("string has isupper method: ", hasattr(str, "isupper"))print("string has isstring method: ", hasattr(str, "isstring"))string has isupper method: True string has isstring method: False
xxxxxxxxxxclass Cat: age = 3 name = "Lily"cat = Cat()hasattr(cat, "age")True
xxxxxxxxxxhasattr(cat, "gender")False
xxxxxxxxxxReferences and Further Reading¶
xxxxxxxxxxxxxxxxxxxxCopyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxPandas: Getting Started
xxxxxxxxxxPandas¶
Pandas is a Python library used for working with datasets. It does that by helping us make sense of DataFrames, which are a form of two-dimensional structured data, like a table with columns and rows. But before we can do anything else, we need to start with data in a CSV file.
xxxxxxxxxxImporting Data¶
xxxxxxxxxxCSV Files¶
CSV stands for Comma Separated Values, and it's a file type that allows data to be saved in a table. Data presented in a table is called structured data, because it adheres to the idea that there is a meaningful relationship between the columns and rows. A CSV might also show panel data, which is data that shows observations of the same behavior at various different times. The datasets we're using in this part of the course are all structured tables, but you'll see other arrangements of data as you move through your projects.
If you're familiar with the way data tables look in spreadsheet applications like Excel, you might be surprised to see that raw CSV files don't look like that. If you came across a CSV file and opened it to see what it looked like, you'd see something like this:
property_type,department,lat,lon,area_m2,price_usd
house,Bogotá D.C,4.69,-74.048,187.0,"$330,899.98"
house,Bogotá D.C,4.695,-74.082,82.0,"$121,555.09"
house,Quindío,4.535,-75.676,235.0,"$219,474.47"
house,Bogotá D.C,4.62,-74.129,195.0,"$97,919.38"
xxxxxxxxxxDictionaries¶
xxxxxxxxxxYou can create a DataFrame from a Python dictionary using from_dict function.
import pandas as pddata = {"col_1": [3, 2, 1,0], "col_2": ["a", "b", "c", "d"]}pd.DataFrame.from_dict(data)| col_1 | col_2 | |
|---|---|---|
| 0 | 3 | a |
| 1 | 2 | b |
| 2 | 1 | c |
| 3 | 0 | d |
xxxxxxxxxxBy default, DataFrame will be created using keys as columns. Note the length of the values should be equal for each key for the code to work. We can also let keys to be index instead of the columns:
pd.DataFrame.from_dict(data, orient="index")| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| col_1 | 3 | 2 | 1 | 0 |
| col_2 | a | b | c | d |
xxxxxxxxxxWe can also specify column names:
pd.DataFrame.from_dict(data, orient="index", columns=["A", "B", "C", "D"])| A | B | C | D | |
|---|---|---|---|---|
| col_1 | 3 | 2 | 1 | 0 |
| col_2 | a | b | c | d |
xxxxxxxxxxPractice
Try it yourself! Create a DataFrame called using the dictionary clothes and make the keys as index, and put column names as ['color','size']
clothes = {"shirt": ["red", "M"], "sweater": ["yellow", "L"], "jacket": ["black", "L"]}pd.DataFrame.from_dict(clothes, orient="index", columns=["color","size"])| color | size | |
|---|---|---|
| shirt | red | M |
| sweater | yellow | L |
| jacket | black | L |
xxxxxxxxxxJSON Files¶
xxxxxxxxxxJSON is short for JavaScript Object Notation. It is another widely used data format to store and transfer the data. It is light-weight and very human readable. In Python, we can use the json library to read JSON files. Here is an example of a JSON string.
info = """{ "firstName": "Jane", "lastName": "Doe", "hobby": "running", "age": 35}"""print(info){
"firstName": "Jane",
"lastName": "Doe",
"hobby": "running",
"age": 35
}
xxxxxxxxxxUse json library to load the json string into a Python dictionary:
import jsondata = json.loads(info)data{'firstName': 'Jane', 'lastName': 'Doe', 'hobby': 'running', 'age': 35}xxxxxxxxxxWe can load a json string or file into a dictionary because they are organized in the same way: key-value pairs.
data["firstName"]'Jane'
xxxxxxxxxxA dictionary may not be as convenient as a DataFrame in terms of data manipulation and cleaning. But once we've turned our json string into a dictionary, we can transform it into a DataFrame using the from_dict method.
df = pd.DataFrame.from_dict(data, orient="index", columns=["subject 1"])df| subject 1 | |
|---|---|
| firstName | Jane |
| lastName | Doe |
| hobby | running |
| age | 35 |
xxxxxxxxxxPractice
Try it yourself! Load the JSON file clothes and then transform it to DataFrame, name column properly.
clothes = """{"shirt": ["red","M"], "sweater": ["yellow","L"]}"""data = json.loads(clothes)df = pd.DataFrame.from_dict(data, orient="index", columns=["color","size"])df| color | size | |
|---|---|---|
| shirt | red | M |
| sweater | yellow | L |
xxxxxxxxxxLoad Compressed file in Python¶
In the big data era, it is very likely that we'll need to read data from compressed files. One way to unzip the data is to use gzip. We can load the poland-bankruptcy-data-2008.json.gz file from the data folder using the following code:
import gzipimport jsonwith gzip.open("data/poland-bankruptcy-data-2008.json.gz", "r") as f: poland_data_gz = json.load(f)xxxxxxxxxxpoland_data_gz is a dictionary, and we only need the data portion of it.
poland_data_gz.keys()dict_keys(['schema', 'data', 'metadata'])
xxxxxxxxxxWe can use the from_dict function from pandas to read the data:
import pandas as pddf = pd.DataFrame().from_dict(poland_data_gz["data"])df.head()| company_id | feat_1 | feat_2 | feat_3 | feat_4 | feat_5 | feat_6 | feat_7 | feat_8 | feat_9 | ... | feat_56 | feat_57 | feat_58 | feat_59 | feat_60 | feat_61 | feat_62 | feat_63 | feat_64 | bankrupt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.202350 | 0.46500 | 0.240380 | 1.5171 | -14.547 | 0.510690 | 0.25366 | 0.91816 | 1.15190 | ... | 0.13184 | 0.473950 | 0.86816 | 0.00024 | 8.5487 | 5.16550 | 107.740 | 3.38790 | 5.3440 | False |
| 1 | 2 | 0.030073 | 0.59563 | 0.186680 | 1.3382 | -37.859 | -0.000319 | 0.04167 | 0.67890 | 0.32356 | ... | 0.12146 | 0.074369 | 0.87235 | 0.00000 | 1.5264 | 0.63305 | 622.660 | 0.58619 | 1.2381 | False |
| 2 | 3 | 0.257860 | 0.29949 | 0.665190 | 3.2211 | 71.799 | 0.000000 | 0.31877 | 2.33200 | 1.67620 | ... | 0.16499 | 0.369210 | 0.81614 | 0.00000 | 4.3325 | 3.19850 | 65.215 | 5.59690 | 47.4660 | False |
| 3 | 4 | 0.227160 | 0.67850 | 0.042784 | 1.0828 | -88.212 | 0.000000 | 0.28505 | 0.47384 | 1.32410 | ... | 0.29358 | 0.706570 | 0.78617 | 0.48456 | 5.2309 | 5.06750 | 142.460 | 2.56210 | 3.0066 | False |
| 4 | 5 | 0.085443 | 0.38039 | 0.359230 | 1.9444 | 21.731 | 0.187900 | 0.10823 | 1.37140 | 1.11260 | ... | 0.10124 | 0.163790 | 0.89876 | 0.00000 | 5.7035 | 4.00200 | 89.058 | 4.09840 | 5.9874 | False |
5 rows × 66 columns
xxxxxxxxxxPractice
Read poland-bankruptcy-data-2007.json.gz into a DataFrame.
# Load file into dictionaryimport gzipimport jsonimport pandas as pdwith gzip.open("data/poland-bankruptcy-data-2007.json.gz","r") as f: polandict = json.load(f)polandict.keys()# Transform dictionary into DataFramedf = pd.DataFrame.from_dict(polandict["data"])df.head()| company_id | feat_1 | feat_2 | feat_3 | feat_4 | feat_5 | feat_6 | feat_7 | feat_8 | feat_9 | ... | feat_56 | feat_57 | feat_58 | feat_59 | feat_60 | feat_61 | feat_62 | feat_63 | feat_64 | bankrupt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.200550 | 0.37951 | 0.39641 | 2.0472 | 32.3510 | 0.38825 | 0.249760 | 1.33050 | 1.1389 | ... | 0.121960 | 0.39718 | 0.87804 | 0.001924 | 8.4160 | 5.1372 | 82.658 | 4.4158 | 7.4277 | False |
| 1 | 2 | 0.209120 | 0.49988 | 0.47225 | 1.9447 | 14.7860 | 0.00000 | 0.258340 | 0.99601 | 1.6996 | ... | 0.121300 | 0.42002 | 0.85300 | 0.000000 | 4.1486 | 3.2732 | 107.350 | 3.4000 | 60.9870 | False |
| 2 | 3 | 0.248660 | 0.69592 | 0.26713 | 1.5548 | -1.1523 | 0.00000 | 0.309060 | 0.43695 | 1.3090 | ... | 0.241140 | 0.81774 | 0.76599 | 0.694840 | 4.9909 | 3.9510 | 134.270 | 2.7185 | 5.2078 | False |
| 3 | 4 | 0.081483 | 0.30734 | 0.45879 | 2.4928 | 51.9520 | 0.14988 | 0.092704 | 1.86610 | 1.0571 | ... | 0.054015 | 0.14207 | 0.94598 | 0.000000 | 4.5746 | 3.6147 | 86.435 | 4.2228 | 5.5497 | False |
| 4 | 5 | 0.187320 | 0.61323 | 0.22960 | 1.4063 | -7.3128 | 0.18732 | 0.187320 | 0.63070 | 1.1559 | ... | 0.134850 | 0.48431 | 0.86515 | 0.124440 | 6.3985 | 4.3158 | 127.210 | 2.8692 | 7.8980 | False |
5 rows × 66 columns
xxxxxxxxxxPickle Files¶
xxxxxxxxxxPickle in Python is primarily used in serializing and deserializing a Python object structure. Serialization is the process of turning an object in memory into a stream of bytes so you can store it on disk or send it over a network. Deserialization is the reverse process: turning a stream of bytes back into an object in memory.
According to the pickle module documentation, the following types can be pickled:
None- Booleans
- Integers, long integers, floating point numbers, complex numbers
- Normal and Unicode strings
- Tuples, lists, sets, and dictionaries containing only objects that can be pickled
- Functions defined at the top level of a module
- Built-in functions defined at the top level of a module
- Classes that are defined at the top level of a module
Let's demonstrate using a python dictionary as an example.
clothes = {"shirt": ["red", "M"], "sweater": ["yellow", "L"], "jacket": ["black", "L"]}clothes{'shirt': ['red', 'M'], 'sweater': ['yellow', 'L'], 'jacket': ['black', 'L']}import picklepickle.dump(clothes, open("./data/clothes.pkl", "wb"))xxxxxxxxxxNow in the data folder, there will be a file named clothes.pkl. We can read the pickled file using the following code:
with open("./data/clothes.pkl", "rb") as f: unpickled = pickle.load(f)unpickled{'shirt': ['red', 'M'], 'sweater': ['yellow', 'L'], 'jacket': ['black', 'L']}xxxxxxxxxxNote first we are using wb inside the open function because we are creating this file, while deserializing the file, we are using rb to read the file
xxxxxxxxxxPractice
Store the sample list into a pickle file, and load the pickle file back to a list.
sample_list = [1, 2, 3, 4, 5]import picklepickle.dump(sample_list,open("./data/numlist.pkl","wb"))with open("./data/numlist.pkl","rb") as f: unpickled = pickle.load(f)unpickled[1, 2, 3, 4, 5]
xxxxxxxxxxWorking with DataFrames¶
The first thing we need to do is import pandas; we'll use pd as an alias when we include it in our code.
Pandas is just a library; to get anything done, we need a dataset too. We'll use the read_csv method to create a DataFrame from a CSV file.
import pandas as pddf = pd.read_csv("data/colombia-real-estate-1.csv")df.head()| property_type | department | lat | lon | area_m2 | price_usd | |
|---|---|---|---|---|---|---|
| 0 | house | Bogotá D.C | 4.690 | -74.048 | 187.0 | $330,899.98 |
| 1 | house | Bogotá D.C | 4.695 | -74.082 | 82.0 | $121,555.09 |
| 2 | house | Quindío | 4.535 | -75.676 | 235.0 | $219,474.47 |
| 3 | house | Bogotá D.C | 4.620 | -74.129 | 195.0 | $97,919.38 |
| 4 | house | Atlántico | 11.012 | -74.834 | 112.0 | $115,477.34 |
xxxxxxxxxxPractice
Try it yourself! Create a DataFrame called df2 using the colombia-real-estate-2 CSV file.
df2 = ...df2.head()--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In [22], line 2 1 df2 = ... ----> 2 df2.head() AttributeError: 'ellipsis' object has no attribute 'head'
xxxxxxxxxxWorking with DataFrame Indices¶
xxxxxxxxxxA DataFrame stores data in a row-and-column format. The DataFrame Index is a special kind of column that helps identify the location of each row. The default Index uses integers starting at zero, but you can also set up customized indices like "name", "location", etc. For example, in the following real estate data set, the default index are the integer counts.
import pandas as pddf = pd.read_csv("./data/colombia-real-estate-2.csv")df.head()| property_type | department | lat | lon | area_m2 | price_cop | |
|---|---|---|---|---|---|---|
| 0 | house | Magdalena | 11.233 | -74.204 | 235.0 | 4.000000e+08 |
| 1 | house | Bogotá D.C | 4.714 | -74.030 | 130.0 | 8.500000e+08 |
| 2 | house | Cundinamarca | 4.851 | -74.059 | 137.0 | 4.750000e+08 |
| 3 | house | Atlántico | 11.006 | -74.808 | 346.0 | 1.400000e+09 |
| 4 | house | Cundinamarca | 4.857 | -74.061 | 175.0 | 4.300000e+08 |
xxxxxxxxxxWe can call the index column through .index:
df.index[:5]RangeIndex(start=0, stop=5, step=1)
xxxxxxxxxxUse the set_index method, we can set the column department as the index instead. Note index column cannot have duplicate rows, like here we cannot set property_type as the index column.
df.set_index("department", inplace=True)df.head()xxxxxxxxxxNow you can see the index column has changed:
df.index[:5]xxxxxxxxxxUsing the reset_index() function, we can reset index back to default integer counts, and department will become a column again.
df.reset_index(inplace=True)df.head()xxxxxxxxxxPractice
Try it yourself! Set letter as the index, then call the index. Then reset the index.
data = { "letter": ["a", "b", "c", "d"], "number": [3, 2, 1, 0], "location": ["east", "east", "east", "west"],}df = pd.DataFrame.from_dict(data)# set index 'numbers'df.index[:3]df.set_index("letter",inplace=True)df.index[:3]# reset indexdfxxxxxxxxxxInspecting DataFrames¶
Once we've created a DataFrame, we need to inspect it in order to see what's there. Pandas has many ways to inspect a DataFrame, but we're only going to look at three of them: shape, info, and head.
If we're interested in understanding the dimensionality of the DataFrame, we can use the df.shape method. The code looks like this:
df.shapexxxxxxxxxxThe shape output tells us that the colombia-real-estate-1 DataFrame -- which we called df1 -- has 3066 rows and 6 columns.
xxxxxxxxxxIf we're trying to get a general idea of what the DataFrame contained, we can use the info method. The code looks like this:
df.info()xxxxxxxxxxThe info output tells us all sorts of things about the DataFrame: the number of columns, the names of the columns, the data type for each column, how many non-null rows are contained in the DataFrame.
xxxxxxxxxxPractice
Try it yourself! Use info and shape to explore df2, which you created above.
xxxxxxxxxxIf we wanted to see all the rows in our new DataFrame, we could use the print method. Keep in mind that the entire dataset gets printed when you use print, even though it only shows you the first few lines. That's not much of a problem with this particular dataset, but once you start working with much bigger datasets, printing the whole thing will cause all sorts of problems.
So instead of doing that, we'll just take a look at the first five rows by using the head method. The code looks like this:
df.head()xxxxxxxxxxBy default, head returns the first five rows of data, but you can specify as many rows as you like. Here's what the code looks like for just the first two rows:
print(df.head(2))xxxxxxxxxxPractice
Try it yourself! Use the head method to return the first five and first 7 rows of the colombia-real-estate-2 dataset.
xxxxxxxxxxSorting¶
xxxxxxxxxxEven though the DataFrame in many ways behaves similarly to a dict, it also is ordered. Therefore we can sort the data in it. Pandas provides two sorting methods, sort_values and sort_index.
import pandas as pddf = pd.read_csv("data/colombia-real-estate-1.csv")df.head()| property_type | department | lat | lon | area_m2 | price_usd | |
|---|---|---|---|---|---|---|
| 0 | house | Bogotá D.C | 4.690 | -74.048 | 187.0 | $330,899.98 |
| 1 | house | Bogotá D.C | 4.695 | -74.082 | 82.0 | $121,555.09 |
| 2 | house | Quindío | 4.535 | -75.676 | 235.0 | $219,474.47 |
| 3 | house | Bogotá D.C | 4.620 | -74.129 | 195.0 | $97,919.38 |
| 4 | house | Atlántico | 11.012 | -74.834 | 112.0 | $115,477.34 |
xxxxxxxxxxWe can sort the whole DataFrame by values of a column.
df.sort_values("area_m2").head()| property_type | department | lat | lon | area_m2 | price_usd | |
|---|---|---|---|---|---|---|
| 958 | house | Cundinamarca | 4.936000 | -74.022000 | 64.0 | $77,660.19 |
| 1466 | house | Atlántico | 10.999000 | -74.803000 | 64.0 | $104,672.44 |
| 2883 | house | Bogotá D.C | 4.618000 | -74.069000 | 64.0 | $110,750.19 |
| 1947 | apartment | Bogotá D.C | 4.695374 | -74.063422 | 64.0 | $117,600.67 |
| 919 | house | Bogotá D.C | 4.689000 | -74.050000 | 64.0 | $114,802.03 |
xxxxxxxxxxWe can also sort the DataFrame by it's index.
df.set_index("lat").sort_index().head()| property_type | department | lon | area_m2 | price_cop | |
|---|---|---|---|---|---|
| lat | |||||
| 0.001298 | apartment | Cundinamarca | 0.000463 | 140.0 | 8.200000e+08 |
| 0.001392 | apartment | Cundinamarca | 0.000139 | 124.0 | 2.300000e+08 |
| 0.001655 | apartment | Cundinamarca | -0.000050 | 120.0 | 5.900000e+08 |
| 3.138000 | house | Valle del Cauca | -76.593000 | 240.0 | 6.500000e+08 |
| 3.179440 | house | Cundinamarca | -75.638610 | 256.0 | 1.300000e+09 |
xxxxxxxxxxPractice
Sort the previous DataFrame by column price_usd, show head only.
xxxxxxxxxxWorking with Columns¶
Sometimes, it’s handy to duplicate a column of data. It might be that you’d like to drop some data points or erase empty cells while still preserving the original column. If you’d like to do that, you’ll need to duplicate the column. We can do this by placing the name of the new column in square brackets.
xxxxxxxxxxAdding Columns¶
For example, we might want to add a column of data that shows the price per square meter of each house in US dollars. To do that, we're going to need to create a new column, and include the necessary math to populate it. First, we need to import the CSV and inspect the first five rows using the head method, like this:
df3 = pd.read_csv("data/colombia-real-estate-3.csv")df3.head(10)| property_type | place_with_parent_names | lat-lon | area_m2 | price_usd | |
|---|---|---|---|---|---|
| 0 | house | |Colombia|Bogotá D.C|Suba| | 4.722,-74.059 | 113.0 | 162073.45 |
| 1 | house | |Colombia|Valle del Cauca|Cali| | 3.455,-76.522 | 210.0 | 151943.86 |
| 2 | house | |Colombia|Bogotá D.C|Chapinero| | 4.676,-74.044 | 183.0 | 422066.30 |
| 3 | house | |Colombia|Atlántico|Barranquilla| | 10.999,-74.816 | 85.0 | 84413.26 |
| 4 | house | |Colombia|Valle del Cauca|Cali| | 3.334,-76.547 | 145.0 | 131577.65 |
| 5 | house | |Colombia|Bogotá D.C|Barrios Unidos| | 4.7,-74.054 | 100.0 | 162073.45 |
| 6 | house | |Colombia|Bogotá D.C|Suba| | 4.725,-74.068 | 260.0 | 523362.21 |
| 7 | house | |Colombia|Bogotá D.C|Suba| | 4.729,-74.07 | 118.0 | 167138.25 |
| 8 | house | |Colombia|Bogotá D.C|Suba| | 4.726,-74.068 | 68.0 | 84413.26 |
| 9 | house | |Colombia|Cundinamarca|La Mesa| | 4.623,-74.471 | 67.0 | 93192.23 |
xxxxxxxxxxThen, we create a new column called "price_m2", provide the formula to populate it, and inspect the first five rows of the dataset to make sure the new column includes the new values:
df3["price_m2"] = df3["price_usd"] / df3["area_m2"]df3.head()xxxxxxxxxxPractice
Try it yourself! Add a column to the colombia-real-estate-2 dataset that shows the price per square meter of each house in Colombian pesos.
df = ...df["price_m2"] = ...xxxxxxxxxxDropping Columns¶
Just like we can add columns, we can also take them away. To do this, we’ll use the drop method. If I wanted to drop the “department” column from colombia-real-estate-1, the code would look like this:
df2 = df.drop("department", axis="columns")df2.head()xxxxxxxxxxNote that we specified that we wanted to drop a column by setting the axis argument to "columns". We can drop rows from the dataset if we change the axis argument to "index". If we wanted to drop row 2 from the df2 data, the code would look like this:
df2 = df.drop(2, axis="index")df2.head()xxxxxxxxxxPractice
Try it yourself! Drop the "property_type" column and row 4 in the colombia-real-estate-2 dataset.
df1 = ...xxxxxxxxxxDropping Rows¶
Including rows with empty cells can radically skew the results of our analysis, so we often drop them from the dataset. We can do this with the dropna method. If we wanted to do this with df, the code would look like this:
print("df shape before dropping rows", df.shape)df.dropna(inplace=True)print("df shape after dropping rows", df.shape)df.head()xxxxxxxxxxBy default, pandas will keep the original DataFrame, and will create a copy that reflects the changes we just made. That's perfectly fine, but if we want to make sure that copies of the DataFrame aren't clogging up the memory on our computers, then we need to intervene with the inplace argument. inplace=True means that we want the original DataFrame updated without making a copy. If we don't include inplace=True (or if we do include inplace=False), then pandas will revert to the default.
xxxxxxxxxxPractice
Drop rows with empty cells from the colombia-real-estate-2 dataset.
df2 = ...xxxxxxxxxxSplitting Strings¶
It might be useful to split strings into their constituent parts, and create new columns to contain them. To do this, we’ll use the .str.split method, and include the character we want to use as the place where the data splits apart. In the colombia-real-estate-3 dataset, we might be interested breaking the "lat-lon" column into a "lat" column and a "lon" column. We’ll split it at “,” with code that looks like this:
#df3["lat-lon"].str.split(",", expand=True)df3[["lat", "lon"]] = df3["lat-lon"].str.split(",", expand=True)| 0 | 1 | |
|---|---|---|
| 0 | 4.722 | -74.059 |
| 1 | 3.455 | -76.522 |
| 2 | 4.676 | -74.044 |
| 3 | 10.999 | -74.816 |
| 4 | 3.334 | -76.547 |
| ... | ... | ... |
| 3060 | 4.728 | -74.044 |
| 3061 | 4.682 | -74.056 |
| 3062 | 3.346 | -76.537 |
| 3063 | 4.7 | -74.028 |
| 3064 | 4.718 | -74.08 |
2956 rows × 2 columns
xxxxxxxxxxHere, expand is telling pandas to make the DataFrame bigger; that is, to create a new column without dropping any of the ones that already exist.
xxxxxxxxxxPractice
Try it yourself! In df3, split "place_with_parent_names" into three columns (one called "place", one called "department", and one called "state", using the character “|”, and then return the new "department" column.
df3.dropna(inplace=True)df3[""]=df3["place_with_parent_names"].str.split("|",expand=True)[1]0 Colombia
1 Colombia
2 Colombia
3 Colombia
4 Colombia
...
3060 Colombia
3061 Colombia
3062 Colombia
3063 Colombia
3064 Colombia
Name: 1, Length: 2956, dtype: objectxxxxxxxxxxRecasting Data¶
Depending on who formatted your dataset, the types of data assigned to each column might need to be changed. If, for example, a column containing only numbers had been mistaken for a column containing only strings, we’d need to change that through a process called recasting. Using the colombia-real-estate-1 dataset, we could recast the entire dataset as strings by using the astype method, like this:
print(df.info())newdf = df.astype("str")print(newdf.info())xxxxxxxxxxThis is a useful approach, but, more often than not, you’ll want to only recast individual columns. In the colombia-real-estate-1 dataset, the "area_m2" column is cast as float64. Let's change it to int. We’ll still use the astype method, but we'll insert the name of the column. The code looks like this:
df["area_m2"] = df.area_m2.astype(int)df.info()xxxxxxxxxxPractice
Try it yourself! In the colombia-real-estate-2 dataset, recast "price_cop" as an object.
df = ...df2["price_cop"] = ...df.info()xxxxxxxxxxAccess a substring in a Series¶
To access a substring from a Series, use the .str attribute from the Series. Then, index each string in the Series by providing the start:stop:step. Keep in mind that the start position is inclusive and the stop position is exclusive, meaning the value at the start index is included but the value at the stop index is not included. Also, Python is a 0-indexed language, so the first element in the substring is at index position 0. For example, using the colombia-real-estate-1 dataset, we could the values at index position 0, 2, and 4 of the department column:
df["department"].str[0:5:2]0 Bgt
1 Bgt
2 Qid
3 Bgt
4 Aln
...
3061 Bgt
3062 Blv
3063 Cni
3064 Bgt
3065 Bgt
Name: department, Length: 3066, dtype: objectxxxxxxxxxxPractice: Access a substring in a Series using pandas
Try it yourself! In the colombia-real-estate-2 dataset, access the property_type column and return the first 5 characters from each row:
xxxxxxxxxxReplacing String Characters¶
Another change you might want to make is replacing the characters in a string. To do this, we’ll use the replace method again, being sure to specify which string should be replaced, and what new string should replace it. For example, if we wanted to replace the string “house” with the string “single_family” in the colombia-real-estate-1 dataset, the code would look like this:
df["property_type"] = df["property_type"].str.replace("house", "single_family")df.head()xxxxxxxxxxNote that the old value needs to come before the new value inside the parentheses of str.replace.
xxxxxxxxxxPractice
Try it yourself! In the colombia-real-estate-2 dataset, change “apartment” to “multi_family” and print the result.
df = ...xxxxxxxxxxRename a Series¶
Another change you might want to make is to rename a Series in pandas. To do this, we’ll use the rename method, being sure to specify the mapping of old and new columns. For example, if we wanted to replace the column name property_type with the string type_property in the colombia-real-estate-1 dataset, the code would look like this:
df.rename(columns={"property_type": "type_property"})xxxxxxxxxxPractice: Rename a Series
Try it yourself! In the colombia-real-estate-2 dataset, change the column lat to latitude and print the head of DataFrame.
xxxxxxxxxxDetermine the unique values in a column¶
You might be interested in the unique values in a Series using pandas. To do this, we’ll use the unique method. For example, if we wanted to identify the unique values in the column property_type in the colombia-real-estate-1 dataset, the code would look like this:
df["property_type"].unique()array(['house', 'apartment'], dtype=object)
xxxxxxxxxxPractice: Determine the unique values in a column
Try it yourself! In the colombia-real-estate-2 dataset, identify the unique values in the column department:
xxxxxxxxxxReplacing Column Values¶
xxxxxxxxxxIf you want to replace a columns' values, simply use the .replace() function:
# Series.rename() exampledf = pd.read_csv("data/colombia-real-estate-2.csv")df.head()| property_type | department | lat | lon | area_m2 | price_cop | |
|---|---|---|---|---|---|---|
| 0 | house | Magdalena | 11.233 | -74.204 | 235.0 | 4.000000e+08 |
| 1 | house | Bogotá D.C | 4.714 | -74.030 | 130.0 | 8.500000e+08 |
| 2 | house | Cundinamarca | 4.851 | -74.059 | 137.0 | 4.750000e+08 |
| 3 | house | Atlántico | 11.006 | -74.808 | 346.0 | 1.400000e+09 |
| 4 | house | Cundinamarca | 4.857 | -74.061 | 175.0 | 4.300000e+08 |
xxxxxxxxxxWe can replace a specific row with other values
df["area_m2"].replace(235.0, 0)xxxxxxxxxxIf you want to replace multiple values at the same time, you can also define a dictionary ahead of time, with dictionary keys the originals and dictionary values the replaced values. Then pass the dictionary to the replace() function.
replace_value = {235: 0, 130: 1, 137: 2}df["area_m2"].replace(replace_value)xxxxxxxxxxOr we can apply specific operations to a whole column. In the following example, we have changed the price_cop unit to millions.
df["price_cop"] = df["price_cop"] / 1e6df.head()xxxxxxxxxxPractice: Replace Column Values
Try it yourself! Define a dictionary to replace values in price_cop. Replace 400 to 0, 850 to 1.
replace_value = ...# Replace valuesxxxxxxxxxxConcatenating¶
When we concatenate data, we're combining two or more separate sets of data into a single large dataset.
xxxxxxxxxxConcatenating DataFrames¶
If we want to combine two DataFrames, we need to import Pandas and read in our data.
import pandas as pddf1 = pd.read_csv("data/colombia-real-estate-1.csv")df2 = pd.read_csv("data/colombia-real-estate-2.csv")print("df1 shape:", df1.shape)print("df2 shape:", df2.shape)df1 shape: (3066, 6) df2 shape: (3066, 6)
xxxxxxxxxxNext, we'll use the concat method to put our DataFrames together, using each DataFrame's name in a list.
concat_df = pd.concat([df1, df2])print("concat_df shape:", concat_df.shape)concat_df.head()concat_df shape: (6132, 7)
| property_type | department | lat | lon | area_m2 | price_usd | price_cop | |
|---|---|---|---|---|---|---|---|
| 0 | house | Bogotá D.C | 4.690 | -74.048 | 187.0 | $330,899.98 | NaN |
| 1 | house | Bogotá D.C | 4.695 | -74.082 | 82.0 | $121,555.09 | NaN |
| 2 | house | Quindío | 4.535 | -75.676 | 235.0 | $219,474.47 | NaN |
| 3 | house | Bogotá D.C | 4.620 | -74.129 | 195.0 | $97,919.38 | NaN |
| 4 | house | Atlántico | 11.012 | -74.834 | 112.0 | $115,477.34 | NaN |
xxxxxxxxxxPractice
Try it yourself! Create two DataFrames from colombia-real-estate-2.csv and colombia-real-estate-3.csv, and concatenate them as the DataFrame concat_df.
df2 = ...df3 = ...concat_df = ...concat_df.head()xxxxxxxxxxConcatenating Series¶
We can also concatenate a Series using a similar set of commands. First, let's take two Series from the df1 and df2 respectively.
df1 = pd.read_csv("data/colombia-real-estate-1.csv")df2 = pd.read_csv("data/colombia-real-estate-2.csv")sr1 = df1["property_type"]sr2 = df2["property_type"]print("len sr1:", len(sr1)),print(sr1.head())print()print("len sr2:", len(sr2)),print(sr2.head())len sr1: 3066 0 house 1 house 2 house 3 house 4 house Name: property_type, dtype: object len sr2: 3066 0 house 1 house 2 house 3 house 4 house Name: property_type, dtype: object
xxxxxxxxxxNow that we have two Series, let's put them together.
concat_sr = pd.concat([sr1, sr2])print("len concat_sr:", len(concat_sr)),print(concat_sr.head())len concat_sr: 6132 0 house 1 house 2 house 3 house 4 house Name: property_type, dtype: object
xxxxxxxxxxPractice
Try it yourself! Use the colombia-real-estate-2 and colombia-rea-estate-3 datasets to create a concatenated Series for the area_m2 column, and print the result.
df1 = ...xxxxxxxxxxSaving a DataFrame as a CSV¶
Once you’ve cleaned all your data and gotten the DataFrame to show everything you want it to show, it’s time to save the DataFrame as a new CSV file using the to_csv method. First, let's load up the colombia-real-estate-1 dataset, and use head to see the first five rows of data:
import pandas as pddf = pd.read_csv("data/colombia-real-estate-1.csv")df.head()xxxxxxxxxxMaybe we're only interested in those first five rows, so let's save that as its own new CSV file using the to_csv method. Note that we're setting the index argument to False so that the DataFrame index isn't included in the CSV file.
df = df.head()df.to_csv("data/small-df.csv", index=False)xxxxxxxxxxReferences & Further Reading¶
- Tutorial for
shape - Tutorial for
info - Adding columns to a DataFrame
- Creating DataFrame from dictionary
- Working with JSON
- Dropping columns from a DataFrame
- Splitting columns in a DataFrame
- Recasting values
- Replacing strings
- Concatenating DataFrames
- From DataFrames to Series
- Stack Overflow: What is serialization
- Understand Python Pickling
xxxxxxxxxxCopyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxx<font size="+3"><strong>Pandas: Advanced</strong></font>Pandas: Advanced
xxxxxxxxxx# Calculate Summary Statistics for a DataFrame or SeriesCalculate Summary Statistics for a DataFrame or Series¶
xxxxxxxxxxMany datasets are large, and it can be helpful to get a summary of information in them. Let's load a dataset and examine the first few rows:Many datasets are large, and it can be helpful to get a summary of information in them. Let's load a dataset and examine the first few rows:
import pandas as pdmexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | expenses | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sell | apartment | |México|Distrito Federal|Álvaro Obregón| | NaN | 35000000.0 | MXN | 35634500.02 | 1894595.53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | http://alvaro-obregon.properati.com.mx/2eb_ven... |
| 1 | sell | apartment | |México|Distrito Federal|Benito Juárez| | NaN | 2000000.0 | MXN | 2036257.11 | 108262.60 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | http://benito-juarez.properati.com.mx/2ec_vent... |
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 2395.975574 | 44262.295082 | NaN | 3.0 | NaN | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 1952.110000 | 49585.937500 | NaN | 5.0 | NaN | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 4 | sell | apartment | |México|Distrito Federal|Álvaro Obregón| | NaN | 6870000.0 | MXN | 6994543.16 | 371882.03 | 180.0 | 136.0 | 2066.011278 | 50514.705882 | NaN | 5.0 | NaN | http://alvaro-obregon.properati.com.mx/2pw_ven... |
xxxxxxxxxxLet's get a summary description of this dataset.Let's get a summary description of this dataset.
mexico_city1.describe()| price | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | expenses | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.758000e+03 | 3.758000e+03 | 3.758000e+03 | 2850.000000 | 3662.000000 | 2075.000000 | 3561.000000 | 575.000000 | 202.000000 | 0.0 |
| mean | 5.015710e+06 | 9.553774e+06 | 5.079498e+05 | 292.954386 | 250.817313 | 1906.810070 | 24475.917409 | 3.627826 | 3.039604 | NaN |
| std | 7.355006e+06 | 1.544553e+07 | 8.211993e+05 | 1816.296990 | 336.555251 | 1813.415255 | 27430.941809 | 22.425889 | 1.410132 | NaN |
| min | 1.485000e+05 | 1.471566e+05 | 7.823940e+03 | 0.000000 | 0.000000 | 1.034212 | 98.221416 | 1.000000 | 1.000000 | NaN |
| 25% | 1.068000e+06 | 1.585902e+06 | 8.431836e+04 | 0.000000 | 70.000000 | 669.510272 | 8684.210526 | 2.000000 | 2.000000 | NaN |
| 50% | 2.550000e+06 | 4.031164e+06 | 2.143267e+05 | 100.000000 | 146.500000 | 1415.132810 | 17460.317460 | 2.000000 | 3.000000 | NaN |
| 75% | 5.833201e+06 | 1.228003e+07 | 6.528979e+05 | 275.750000 | 310.000000 | 2498.628385 | 35810.810811 | 3.000000 | 4.000000 | NaN |
| max | 1.250000e+08 | 3.364841e+08 | 1.789000e+07 | 84500.000000 | 8265.000000 | 28396.825397 | 750000.000000 | 450.000000 | 9.000000 | NaN |
xxxxxxxxxxLike most large datasets, this one has many values which are missing. The describe function will ignore missing values in each column. You can also remove rows and columns with missing values, and then get a summary of the data that's still there. We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this:Like most large datasets, this one has many values which are missing. The describe function will ignore missing values in each column. You can also remove rows and columns with missing values, and then get a summary of the data that's still there. We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this:
mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"],axis =1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... |
xxxxxxxxxxLet's take a look at our new, cleaner dataset.Let's take a look at our new, cleaner dataset.
mexico_city1.describe()| price | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | |
|---|---|---|---|---|---|---|
| count | 1.930000e+03 | 1.930000e+03 | 1.930000e+03 | 1930.000000 | 1930.000000 | 1930.000000 |
| mean | 5.090727e+06 | 9.808822e+06 | 5.215101e+05 | 311.264249 | 274.844560 | 23817.307978 |
| std | 7.254611e+06 | 1.580064e+07 | 8.400795e+05 | 2170.418906 | 357.800175 | 26835.969068 |
| min | 1.485000e+05 | 1.471566e+05 | 7.823940e+03 | 0.000000 | 2.000000 | 98.221416 |
| 25% | 1.064276e+06 | 1.615638e+06 | 8.589933e+04 | 0.000000 | 71.000000 | 8695.652174 |
| 50% | 2.700231e+06 | 3.915195e+06 | 2.081610e+05 | 90.000000 | 157.000000 | 17303.420882 |
| 75% | 5.906450e+06 | 1.228003e+07 | 6.528979e+05 | 280.000000 | 350.000000 | 35779.954128 |
| max | 1.250000e+08 | 2.351062e+08 | 1.250000e+07 | 84500.000000 | 8265.000000 | 750000.000000 |
xxxxxxxxxx<font size="+1">Practice</font> Practice
Reload the mexico-city-real-estate-1.csv dataset. Reverse the sequence of operations by first dropping all rows where there is a missing value, and then dropping the columns, floor, price_usd_per_m2,expenses and rooms. What is the size of the resulting DataFrame?
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1 = ...mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1) # REMOVERHSprint(mexico_city1.shape)--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In [5], line 3 1 mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv") 2 mexico_city1 = ... ----> 3 mexico_city1 = mexico_city1.drop( 4 ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1 5 ) # REMOVERHS 6 print(mexico_city1.shape) AttributeError: 'ellipsis' object has no attribute 'drop'
xxxxxxxxxx# Select a Series from a DataFrameSelect a Series from a DataFrame¶
xxxxxxxxxxSince the datasets we work with are so large, you might want to focus on a single column of a DataFrame. Let's load up the `mexico-city-real-estate-2` dataset, and examine the first few rows to find the column names.Since the datasets we work with are so large, you might want to focus on a single column of a DataFrame. Let's load up the mexico-city-real-estate-2 dataset, and examine the first few rows to find the column names.
mexico_city2 = pd.read_csv("./data/mexico-city-real-estate-2.csv")mexico_city2.head()xxxxxxxxxxMaybe we're interested in the `surface_covered_in_m2` column. The code to extract just that one column looks like this:Maybe we're interested in the surface_covered_in_m2 column. The code to extract just that one column looks like this:
surface_covered_in_m2 = mexico_city2["surface_covered_in_m2"]surface_covered_in_m2xxxxxxxxxx<font size="+1">Practice</font> Practice
Select the price series from the mexico-city-real-estate-2 dataset, and load it into the mexico_city2 DataFrame
price = ...print(price)xxxxxxxxxx# Subset a DataFrame by Selecting One or More ColumnsSubset a DataFrame by Selecting One or More Columns¶
xxxxxxxxxxYou may find it more efficient to work with a smaller portion of a dataset that's relevant to you. One way to do this is to select some columns from a DataFrame and make a new DataFrame with them. Let's load a dataset to do this and examine the first few rows to find the column headings:You may find it more efficient to work with a smaller portion of a dataset that's relevant to you. One way to do this is to select some columns from a DataFrame and make a new DataFrame with them. Let's load a dataset to do this and examine the first few rows to find the column headings:
mexico_city4 = pd.read_csv("./data/mexico-city-real-estate-4.csv")mexico_city4.head()xxxxxxxxxxLet's choose `"operation"`, `"property_type"`, `"place_with_parent_names"`, and `"price"`:Let's choose "operation", "property_type", "place_with_parent_names", and "price":
mexico_city4_subset = mexico_city4[["operation", "property_type", "place_with_parent_names", "price"]]mexico_city4_subsetxxxxxxxxxxOnce we've done that, we can find the resulting number of entries in the DataFrame:Once we've done that, we can find the resulting number of entries in the DataFrame:
mexico_city4_subset.shapexxxxxxxxxx<font size="+1">Practice</font> Practice
Load the mexico-city-real-estate-1.csv dataset and subset it to obtain the operation, lat-lon and place_with_property_names columns only. How many entries are in the resulting DataFrame?
mexico_city1 = ...mexico_city1_subset = mexico_city1[ ["operation", "lat-lon", "place_with_parent_names"]] # REMOVERHSprint(mexico_city1_subset.shape)xxxxxxxxxx# Subset the Columns of a DataFrame Based on Data TypesSubset the Columns of a DataFrame Based on Data Types¶
xxxxxxxxxxIt's helpful to be able to find specific types of entries — typically numeric ones — and put all of these in a separate DataFrame. First, let's take a look at the `mexico-city-real-estate-5` dataset.It's helpful to be able to find specific types of entries — typically numeric ones — and put all of these in a separate DataFrame. First, let's take a look at the mexico-city-real-estate-5 dataset.
mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")mexico_city5.head()xxxxxxxxxxThe code to subset just the numerical entries looks like this:The code to subset just the numerical entries looks like this:
mexico_city5_numbers = mexico_city5.select_dtypes(include="number")mexico_city5_numbers.head()xxxxxxxxxx<font size="+1">Practice</font> Practice
Create a subset of the DataFrame from mexico-city-real-estate-5 which excludes numbers.
mexico_city3 = mexico_city5mexico_city3_no_numbers = mexico_city3.select_dtypes(include="object")print(mexico_city3_no_numbers.info())xxxxxxxxxx# Working with `value_counts` in a SeriesWorking with value_counts in a Series¶
In order to use the data in a series for other types of analysis, it might be helpful to know how often each value occurs in the Series. To do that, we use the value_counts method to aggregate the data. Let's take a look at the number of properties associated with each department in the colombia-real-estate-1 dataset.
df1 = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["department"])#df1["department"].value_counts()df1.value_counts()xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Aggregate the different property types in the colombia-real-estate-2 dataset.
df2 = ...xxxxxxxxxx# Series and `Groupby`Series and Groupby¶
Large Series often include data points that have some attribute in common, but which are nevertheless not grouped together in the dataset. Happily, pandas has a method that will bring these data points together into groups.
Let's take a look at the colombia-real-estate-1 dataset. The set includes properties scattered across Colombia, so it might be useful to group properties from the same department together; to do this, we'll use the groupby method. The code looks like this:
dept_group = df1.groupby("department")df1xxxxxxxxxxTo make sure we got all the departments in the dataset, let's print the first occurrence of each department.To make sure we got all the departments in the dataset, let's print the first occurrence of each department.
dept_group.first()xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Group the properties in colombia-real-estate-2 by department, and print the result.
df2 = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["department"])dept_group = df2.groupby("department")dept_group.first()xxxxxxxxxxNow that we have all the properties grouped by department, we might want to see the properties in just one of the departments. We can use the `get_group` method to do that. If we just wanted to see the properties in `"Santander"`, for example, the code would look like this: Now that we have all the properties grouped by department, we might want to see the properties in just one of the departments. We can use the get_group method to do that. If we just wanted to see the properties in "Santander", for example, the code would look like this:
dept_group1 = df1.groupby("department")dept_group1.get_group("Santander")xxxxxxxxxxWe can also make groups based on more than one category by adding them to the `groupby` method. After resetting the `df1` DataFrame, here's what the code looks like if we want to group properties both by `department` and by `property_type`.We can also make groups based on more than one category by adding them to the groupby method. After resetting the df1 DataFrame, here's what the code looks like if we want to group properties both by department and by property_type.
df1 = pd.read_csv("data/colombia-real-estate-1.csv")dept_group2 = df1.groupby(["department", "property_type"])dept_group2.first()xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Group the properties in colombia-real-estate-2 by department and property type, and print the result.
dept_group = ...dept_group.first()xxxxxxxxxxFinally, it's possible to use `groupby` to calculate aggregations. For example, if we wanted to find the average property area in each department, we would use the `.mean()` method. This is what the code for that looks like:Finally, it's possible to use groupby to calculate aggregations. For example, if we wanted to find the average property area in each department, we would use the .mean() method. This is what the code for that looks like:
dept_group = df1.groupby("department")["area_m2"].mean()dept_groupxxxxxxxxxx*Practice*Practice
Try it yourself! Use the .mean method in the colombia-real-estate-2 dataset to find the average price in Colombian pesos ("price_cop") for properties in each "department".
dept_group = ...dept_groupxxxxxxxxxx# Subset a DataFrame's Columns Based on the Column Data TypesSubset a DataFrame's Columns Based on the Column Data Types¶
xxxxxxxxxxIt's helpful to be able to find entries of a certain type, typically numerical entries, and put all of these in a separate DataFrame. Let's load a dataset to see how this works:It's helpful to be able to find entries of a certain type, typically numerical entries, and put all of these in a separate DataFrame. Let's load a dataset to see how this works:
mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")mexico_city5.head()xxxxxxxxxxNow let's get only numerical entries:Now let's get only numerical entries:
mexico_city5_numbers = mexico_city5.select_dtypes(include="number")mexico_city5_numbers.head()xxxxxxxxxxLet's now find all entries which are not numerical entries:Let's now find all entries which are not numerical entries:
mexico_city5_no_numbers = mexico_city5.select_dtypes(exclude="number")mexico_city5_no_numbers.head()xxxxxxxxxx<font size="+1">Practice</font> Practice
Create a subset of the DataFrame from mexico-city-real-estate-5.csv which excludes numbers. How many entries does it have?
mexico_city3 = ...mexico_city3_no_numbers = ...print(mexico_city3_no_numbers.shape)xxxxxxxxxx# Subset a DataFrame's columns based on column namesSubset a DataFrame's columns based on column names¶
xxxxxxxxxxTo subset a DataFrame by column names, either define a list of columns to include or define a list of columns to exclude. Once you've done that, you can retain or drop the columns accordingly. For example, let's suppose we want to modify the `mexico_city3` dataset and only retain the first three columns. Let's define two lists, one with the columns to retain and one with the columns to drop:To subset a DataFrame by column names, either define a list of columns to include or define a list of columns to exclude. Once you've done that, you can retain or drop the columns accordingly. For example, let's suppose we want to modify the mexico_city3 dataset and only retain the first three columns. Let's define two lists, one with the columns to retain and one with the columns to drop:
drop_cols = [ "lat-lon", "price", "currency", "price_aprox_local_currency", "price_aprox_usd", "surface_total_in_m2", "surface_covered_in_m2", "price_usd_per_m2", "price_per_m2", "floor", "rooms", "expenses", "properati_url",]keep_cols = ["operation", "property_type", "place_with_parent_names"]xxxxxxxxxxNext, we'll explore both approaches to subset `mexico_city3`. To retain columns based on `keep_cols`:Next, we'll explore both approaches to subset mexico_city3. To retain columns based on keep_cols:
mexico_city3_subsetted = mexico_city3[keep_cols]xxxxxxxxxxTo drop columns in `drop_cols`:To drop columns in drop_cols:
mexico_city3_subsetted = mexico_city3.drop(columns=drop_cols)xxxxxxxxxx<font size="+1">Practice</font> Practice
Create a subset of the DataFrame from mexico-city-real-estate-3.csv which excludes the last two columns.
xxxxxxxxxx## Pivot TablesPivot Tables¶
A pivot table allows us to aggregate and summarize a DataFrame across multiple variables. For example, let's suppose we wanted to calculate the mean of the price column in the mexico_city3 dataset for the different values in the property_type column:
import numpy as npmexico_city3_pivot = ...mexico_city3_pivotxxxxxxxxxx# Subsetting with MasksSubsetting with Masks¶
Another way to create subsets from a larger dataset is through masking. Masks are ways to filter out the data you're not interested in so that you can focus on the data you are. For example, we might want to look at properties in Colombia that are bigger than 200 square meters. In order to create this subset, we'll need to use a mask.
First, we'll reset our df1 DataFrame so that we can draw on all the data in its original form. Then we'll create a statement and then assign the result to mask.
import pandas as pddf1 = pd.read_csv("data/colombia-real-estate-1.csv")mask = df1["area_m2"] > 200mask.head()xxxxxxxxxxNotice that `mask` is a Series of Boolean values. Where properties are smaller than 200 square meters, our statement evaluates as `False`; where they're bigger than 200, it evaluates to `True`.Notice that mask is a Series of Boolean values. Where properties are smaller than 200 square meters, our statement evaluates as False; where they're bigger than 200, it evaluates to True.
Once we have our mask, we can use it to select all the rows from df1 that evaluate as True.
df1[mask].head()xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Read colombia-real-estate-2 into a DataFrame named df2, and create a mask to select all the properties that are smaller than 100 square meters.
import pandas as pddf2 = pd.read_csv("data/colombia-real-estate-2.csv")mask = df2["area_m2"]<100df2[mask].head()xxxxxxxxxxWe can also create masks with multiple criteria using `&` for "and" and `|` for "or." For example, here's how we would find all properties in Atlántico that are bigger than 400 square meters.We can also create masks with multiple criteria using & for "and" and | for "or." For example, here's how we would find all properties in Atlántico that are bigger than 400 square meters.
mask = (df1["area_m2"] > 400) & (df1["department"] == "Atlántico")df1[mask].head()xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Create a mask for df2 to select all the properties in Tolima that are smaller than 150 square meters.
mask = ...df2[mask].head()xxxxxxxxxx## Reshape a DataFrame based on column valuesReshape a DataFrame based on column values¶
xxxxxxxxxx## What's a pivot table?What's a pivot table?¶
A pivot table allows you to quickly aggregate and summarize a DataFrame using an aggregation function. For example, to build a pivot table that summarizes the mean of the price_cop column for each of the unique categories in the property_type column in df2:
import numpy as nppivot1 = pd.pivot_table(df2, values="price_cop", index="property_type", aggfunc=np.mean)pivot1xxxxxxxxxx<font size="+1">Practice</font>Practice
Build a pivot table that summarizes the mean of the price_cop column for each of the unique departments in the department column in df2:
# REMOVE {pivot2 = pd.pivot_table(df2, values="price_cop", index="department", aggfunc=np.mean)# REMOVE }pivot2xxxxxxxxxx## Combine multiple categories in a SeriesCombine multiple categories in a Series¶
xxxxxxxxxxCategorical variables can be collapsed into a fewer number of categories. One approach is to retain the values of the most frequently observed values and collapse all remaining values into a single category. For example, to retain only the values of the top 10 most frequent categories in the `department` column and then collapse the other categories together, use `value_counts` to generate the count of the values:Categorical variables can be collapsed into a fewer number of categories. One approach is to retain the values of the most frequently observed values and collapse all remaining values into a single category. For example, to retain only the values of the top 10 most frequent categories in the department column and then collapse the other categories together, use value_counts to generate the count of the values:
df2["department"].value_counts()xxxxxxxxxxNext, select just the top 10 using the `head()` method, and select the column names by using the `index` attribute of the series:Next, select just the top 10 using the head() method, and select the column names by using the index attribute of the series:
top_10 = df2["department"].value_counts().head(10).indextop_10xxxxxxxxxxFinally, use the apply method and a lambda function to select only the values from the `department` column and collapse the remaining values into the value `Other`:Finally, use the apply method and a lambda function to select only the values from the department column and collapse the remaining values into the value Other:
df2["department"] = df2["department"].apply(lambda x: x if x in top_10 else "Other")print(df2["department"].head(20))xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Retain the remaining top 5 most frequent values in the department column and collapse the remaining values into the category Other.
xxxxxxxxxx# Cross TabulationCross Tabulation¶
xxxxxxxxxxThe pandas `crosstab` function is a useful for working with grouped summary statistics for **categorical** data. It starts by picking two categorical columns, then defines one as the index and the other as the column. If the aggregate function and value column is not defined, `crosstab` will simply calculate the frequency of each combination by default. Let's see the example below from the Colombia real estate dataset.The pandas crosstab function is a useful for working with grouped summary statistics for categorical data. It starts by picking two categorical columns, then defines one as the index and the other as the column. If the aggregate function and value column is not defined, crosstab will simply calculate the frequency of each combination by default. Let's see the example below from the Colombia real estate dataset.
import pandas as pddf = pd.read_csv("data/colombia-real-estate-1.csv")df.head()xxxxxxxxxxThe following function calculates the frequency of each combination for two variables, `department` and `property_type`, in the data set.The following function calculates the frequency of each combination for two variables, department and property_type, in the data set.
pd.crosstab(index=df["department"], columns=df["property_type"])xxxxxxxxxxFrom the previous example, you can see we have created a DataFrame with the index showing unique observation in variable `department`, while the columns are the unique observation for the variable `property_type`. Each cell shows how many data points there are for each combination of `department` type and `property_type`. For example, there are 8 `apartments` in department `Antioquia`. From the previous example, you can see we have created a DataFrame with the index showing unique observation in variable department, while the columns are the unique observation for the variable property_type. Each cell shows how many data points there are for each combination of department type and property_type. For example, there are 8 apartments in department Antioquia.
xxxxxxxxxxWe can further specify a value column and an aggregate function, like in `pivot_table`, to conduct more complicated calculations for the two categorical variables. In the following example, we're looking at the average area size for different property types in the departments in Colombia.We can further specify a value column and an aggregate function, like in pivot_table, to conduct more complicated calculations for the two categorical variables. In the following example, we're looking at the average area size for different property types in the departments in Colombia.
import numpy as nppd.crosstab( index=df["department"], columns=df["property_type"], values=df["area_m2"], aggfunc=np.mean,).round(0)xxxxxxxxxx<font size="+1">Practice</font> Practice
Create a cross tabulate calculating frequency of combinations from mexico-city-real-estate-3.csv using currency as the index and property_type as the column.
# Read datasetmexico_city3 = ...# Create `crosstab`xxxxxxxxxx# Applying Functions to DataFrames and SeriesApplying Functions to DataFrames and Series¶
xxxxxxxxxx`apply` is a useful method for to using one function on all the rows or columns of a DataFrame efficiently. Let's take the following real estate dataset as an example:apply is a useful method for to using one function on all the rows or columns of a DataFrame efficiently. Let's take the following real estate dataset as an example:
# Read data, only use the numerical columnsdf = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])df.head()xxxxxxxxxxBy specifying the function inside `apply()`, we can transform the whole DataFrame. For example, I am calculating the square root of each row at each column:By specifying the function inside apply(), we can transform the whole DataFrame. For example, I am calculating the square root of each row at each column:
import numpy as npdf.apply(np.sqrt)xxxxxxxxxxNote you can also specify the `"axis"` inside `apply`. By default, we have `axis=0`, which means we are applying the function to each column. We can also switch to `axis=1` if we want to apply the function to each row. See the following example showing the sum of all rows for each column:Note you can also specify the "axis" inside apply. By default, we have axis=0, which means we are applying the function to each column. We can also switch to axis=1 if we want to apply the function to each row. See the following example showing the sum of all rows for each column:
df.apply(np.sum)xxxxxxxxxxThe following code will get the sum of all columns for each row:The following code will get the sum of all columns for each row:
df["price_cop"]=df["price_cop"].round(0)#df.apply(np.sum, axis=1)#df.head()df["price_cop"]xxxxxxxxxxBy specifying the column name, we can also apply the function to a specific column or columns. Note that we can also specify index (row names) to only apply functions to specific rows, however, it is not common in practice.By specifying the column name, we can also apply the function to a specific column or columns. Note that we can also specify index (row names) to only apply functions to specific rows, however, it is not common in practice.
df["area_m2"].apply(np.sqrt)xxxxxxxxxxWe can assign the results to a new column:We can assign the results to a new column:
df["area_sqrt"] = df["area_m2"].apply(np.sqrt)dfxxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Create a new column named 'sum_columns', which is the sum of all numerical columns in df:
df["sum_columns"] = df.apply(np.sum)df.head()xxxxxxxxxx`df.aggregate()`, or `df.agg()`, shares the same concept as `df.apply()` in terms of applying functions to a DataFrame, but `df.aggregate()` can only apply aggregate functions like `sum`, `mean`, `min`, `max`, etc. See the following example for more details:df.aggregate(), or df.agg(), shares the same concept as df.apply() in terms of applying functions to a DataFrame, but df.aggregate() can only apply aggregate functions like sum, mean, min, max, etc. See the following example for more details:
df = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])df.head()xxxxxxxxxxWe can check what's the minimum number for each column calling the `min` aggregate function:We can check what's the minimum number for each column calling the min aggregate function:
df.agg("min")xxxxxxxxxxLike `apply()`, we can also specify the `axis` argument to switch axis:Like apply(), we can also specify the axis argument to switch axis:
df.agg("min", axis=1)xxxxxxxxxxWe can apply aggregate function to a whole DataFrame using `df.agg()`, or specify the column name for a subset of DataFrame:We can apply aggregate function to a whole DataFrame using df.agg(), or specify the column name for a subset of DataFrame:
df["area_m2"].agg("min")xxxxxxxxxxWe can also apply a list of aggregate functions to a DataFrame:We can also apply a list of aggregate functions to a DataFrame:
df.agg(["sum", "max"])xxxxxxxxxxThe syntax above will calculate both `sum` and `max` for each column, and store the result as index. Besides, we can also apply different aggregate functions to different columns. In this case, we need to pass a dictionary specifying key as column names, and value as corresponding aggregate function names:The syntax above will calculate both sum and max for each column, and store the result as index. Besides, we can also apply different aggregate functions to different columns. In this case, we need to pass a dictionary specifying key as column names, and value as corresponding aggregate function names:
df.agg({"area_m2": "sum", "price_cop": "min"})xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Find the minimum for column area_m2 and the maximum for column price_cop using df.agg():
xxxxxxxxxx# Working with DatesWorking with Dates¶
xxxxxxxxxx## Time StampsTime Stamps¶
xxxxxxxxxxPandas' time series capabilities are built on the `Timestamp` class. For instance, we can transform date strings of different formats into `Timestamp`:Pandas' time series capabilities are built on the Timestamp class. For instance, we can transform date strings of different formats into Timestamp:
print(pd.Timestamp("January 8, 2022"))print(pd.Timestamp("01/08/22 20:13"))xxxxxxxxxxWe can also do date math with `Timestamp`s. Note that when we subtract two dates, we get a special object called a `Timedelta`.We can also do date math with Timestamps. Note that when we subtract two dates, we get a special object called a Timedelta.
time_delta = pd.Timestamp("Feb. 11 2016 2:30 am") - pd.Timestamp("2015-08-03 5:14 pm")print("time_delta type:", type(time_delta))print(time_delta)xxxxxxxxxxThe pandas `Timestamp` class is also compatible with Python's `datetime` module.The pandas Timestamp class is also compatible with Python's datetime module.
import datetimepd.Timestamp("Feb 11, 2016") - datetime.datetime(2015, 8, 3)xxxxxxxxxxWe can manipulate dates using some function from `offset()`. We can use the `Day()` function to add or decrease days from a `timestamp`. The following function calculates the date that's four days prior to 9 January 2017:We can manipulate dates using some function from offset(). We can use the Day() function to add or decrease days from a timestamp. The following function calculates the date that's four days prior to 9 January 2017:
import pandas as pdfrom pandas.tseries.offsets import BDay, BMonthEnd, Dayprint(pd.Timestamp("January 9, 2017") - Day(4))xxxxxxxxxxIn some case, you might need to add or subtract only business days. That's when you use `BDay()`:In some case, you might need to add or subtract only business days. That's when you use BDay():
print(pd.Timestamp("January 9, 2017") - BDay(4))xxxxxxxxxxWe can also do an offset at the monthly level. For example, you can use `BMonthEnd(n)` to find the last business day $n$ months later. The following function shows the last business day of January 2017:We can also do an offset at the monthly level. For example, you can use BMonthEnd(n) to find the last business day
print(pd.Timestamp("January 9, 2017") + BMonthEnd(0))xxxxxxxxxxThe following function shows the last business day for May 2017, which is four months after February:The following function shows the last business day for May 2017, which is four months after February:
print(pd.Timestamp("February 9, 2017") + BMonthEnd(4))xxxxxxxxxxWe can also use the `strptime` function inside `datetime` to transform string to date format:We can also use the strptime function inside datetime to transform string to date format:
from datetime import datetimedate = datetime.strptime("16 Oct 2022 12:14:05", "%d %b %Y %H:%M:%S")datexxxxxxxxxxWe can further transform it into the [ISO 8601 format](https://en.wikipedia.org/wiki/ISO_8601):We can further transform it into the ISO 8601 format:
date.isoformat()xxxxxxxxxx## Date Time IndicesDate Time Indices¶
xxxxxxxxxx`DatetimeIndex` is a convenient function that transforms date-like data into readable `Datetime` format for a DataFrame index. That way, we can better plot and model time series data. Let's check an example. Here we have Apple's stock prices from 1999 10 2022. DatetimeIndex is a convenient function that transforms date-like data into readable Datetime format for a DataFrame index. That way, we can better plot and model time series data. Let's check an example. Here we have Apple's stock prices from 1999 10 2022.
data = pd.read_csv("data/AAPL.csv")data.set_index("date", inplace=True)print(data.info())data.head()xxxxxxxxxxEven though the index looks like *dates*, it is not in *date format*. So when we plot the data, the index doesn't follow the sequence of years, and the x-axis ticks are hard to read.Even though the index looks like dates, it is not in date format. So when we plot the data, the index doesn't follow the sequence of years, and the x-axis ticks are hard to read.
data["open"].plot()xxxxxxxxxxYou can see the index is not in date format.You can see the index is not in date format.
data.index[:5]xxxxxxxxxxWe can use the `DatetimeIndex` function to transform it into date:We can use the DatetimeIndex function to transform it into date:
pd.DatetimeIndex(data.index)xxxxxxxxxxAnd we can set it as the index:And we can set it as the index:
data.index = pd.DatetimeIndex(data.index)dataxxxxxxxxxxNow we can see the benefit from plotting:Now we can see the benefit from plotting:
data["open"].plot()xxxxxxxxxx## Date RangesDate Ranges¶
xxxxxxxxxxIf we're entering time series data into a DataFrame, it will often be useful to create a range of dates using `date_range`. We can create it with different frequencies by specifying `freq`. Here are the days in a specific range:If we're entering time series data into a DataFrame, it will often be useful to create a range of dates using date_range. We can create it with different frequencies by specifying freq. Here are the days in a specific range:
pd.date_range(start="1/8/2022", end="3/2/2022", freq="D")xxxxxxxxxxHere are the business dates for a specific range:Here are the business dates for a specific range:
pd.date_range(start="1/8/2022", end="3/2/2022", freq="B")xxxxxxxxxx# References & Further ReadingReferences & Further Reading¶
- Pandas Documentation on Dropping a Column in a DataFrame
- Pandas Documentation on Printing out the First Few Rows of a DataFrame
- Pandas Documentation on Reading in a CSV File Into a DataFrame
- Getting Started with Pandas Documentation
- Pandas Documentation on Extracting a Column to a Series
- Pandas Official Indexing Guide
- Series in pandas
- Tutorial for
value_counts - Tutorial for
groupby - pandas Documentation for
groupby - Pandas Official Indexing Guide
- Online Examples of Selecting Numeric Columns of a DataFrame
- Official Pandas Documentation on Data Types in a DataFrame
- Pandas documentation for Boolean indexing
- More information on creating various kinds of subsets
- More on boolean indexing
- A tutorial on using various criteria to create subsets
- Pandas.DataFrame.apply
- Pandas.DataFrame.aggregate
xxxxxxxxxx---Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxx<font size="+3"><strong>Pandas: Descriptive Statistics</strong></font>Pandas: Descriptive Statistics
xxxxxxxxxx# Descriptive Statistics Descriptive Statistics¶
xxxxxxxxxx*Descriptive statistics* are used to describe the basic features of a dataset.Descriptive statistics are used to describe the basic features of a dataset.
xxxxxxxxxx## QuartilesQuartiles¶
Quartiles divide a sequence of numbers into four equal parts. Grouping a dataset into quartiles helps us find outliers and provides the basis for the data in a boxplot.
xxxxxxxxxx## SeriesSeries¶
A series is a one-dimensional array that can hold any type of data. We'll generally use the term to refer to a column in a dataset that's arranged in a table. In fact, this is the reason why, in the library pandas, DataFrame columns are called Series. In a pandas Series, all items in the array must be the same data type.
xxxxxxxxxx## The MeanThe Mean¶
All the data points in a dataset can be added together and then divided by the total number of data points to find the mean. You might be used to calling this number an average, but the two ideas are the same. Means help us understand the central tendency of a dataset.
xxxxxxxxxx## Skewed DistributionsSkewed Distributions¶
For any given activity, there is a range of probable outcomes. All other things being equal, we would expect most of the outcomes to fall in the middle of the possible range, with the number of outcomes diminishing on either side of the peak. In statistics, this is known as a normal distribution, but you may have heard it called a bell curve, because it looks like a bell. Here's an example:
xxxxxxxxxxxxxxxxxxxxA **skewed distribution** is a type of distribution where the peak of the curve is shifted, or *skewed*, either to the right or the left of the distribution. Here's an example:A skewed distribution is a type of distribution where the peak of the curve is shifted, or skewed, either to the right or the left of the distribution. Here's an example:
xxxxxxxxxxxxxxxxxxxx## VarianceVariance¶
xxxxxxxxxx**Variance** (which is sometimes called **volatility** in finance), is a measurement of how spread out the points in a dataset are around the mean. It's calculated by first summing up the squared different between each data point and mean, then dividing by the number of data points minus 1.Variance (which is sometimes called volatility in finance), is a measurement of how spread out the points in a dataset are around the mean. It's calculated by first summing up the squared different between each data point and mean, then dividing by the number of data points minus 1.
xxxxxxxxxxHere's a great video that shows how to calculate variance:Here's a great video that shows how to calculate variance:
from IPython.display import YouTubeVideoYouTubeVideo("deIQeQzPK08", width=600)xxxxxxxxxx# We can get the variance for any numerical column in a DataFrame by using the `.var()` function:We can get the variance for any numerical column in a DataFrame by using the .var() function:¶
import pandas as pddf = pd.read_csv("data/colombia-real-estate-1.csv")df.head()| property_type | department | lat | lon | area_m2 | price_usd | |
|---|---|---|---|---|---|---|
| 0 | house | Bogotá D.C | 4.690 | -74.048 | 187.0 | $330,899.98 |
| 1 | house | Bogotá D.C | 4.695 | -74.082 | 82.0 | $121,555.09 |
| 2 | house | Quindío | 4.535 | -75.676 | 235.0 | $219,474.47 |
| 3 | house | Bogotá D.C | 4.620 | -74.129 | 195.0 | $97,919.38 |
| 4 | house | Atlántico | 11.012 | -74.834 | 112.0 | $115,477.34 |
df["area_m2"].var()8057.128914931936
xxxxxxxxxx<font size="+1">Practice</font>Practice
Get the variance of the lat and lon column.
xxxxxxxxxxBecause variance is the squared deviation from the mean, it's heavily influenced by outliers. When the difference between the outliers and the mean are too far away from each other, the variance might not reveal the true information of how data points are distributed. In this case, we can calculate **trimmed variance** instead. Trimmed variance is the variance calculated excluding the largest and the smallest data points. Using `trimmed_var` function in the SciPy library, we can calculate the trimmed variance for a column.Because variance is the squared deviation from the mean, it's heavily influenced by outliers. When the difference between the outliers and the mean are too far away from each other, the variance might not reveal the true information of how data points are distributed. In this case, we can calculate trimmed variance instead. Trimmed variance is the variance calculated excluding the largest and the smallest data points. Using trimmed_var function in the SciPy library, we can calculate the trimmed variance for a column.
from scipy import statsstats.mstats.trimmed_var(df["area_m2"])3822.0997717413616
xxxxxxxxxxNote the trimmed variance is much smaller than the variance.Note the trimmed variance is much smaller than the variance.
xxxxxxxxxx<font size="+1">Practice</font>Practice
Get the trimmed variance for lat.
xxxxxxxxxx## Standard DeviationStandard Deviation¶
Standard deviation describes the proportion of records above or below the mean of a given distribution. In a normal distribution, 68% of the values fall within one standard deviation of the mean, 95% of the values fall within two standard deviations from the mean, and 99.7% of the values fall within three standard deviations from the mean.
xxxxxxxxxxMention that, in finance, standard deviation can be called **volatility**.Mention that, in finance, standard deviation can be called volatility.
xxxxxxxxxx## OutliersOutliers¶
An outlier is a value in a dataset that falls well beyond the dataset mean — more than three standard deviations. Depending on the analytical strategy, it might be useful to drop outliers from a dataset, because their extreme deviation from the mean can result in misleading conclusions.
xxxxxxxxxx ## Categorical DataCategorical Data¶
Categorical data is any type of data that can only be represented by distinct values. Eye color, handedness, and academic attainment are all categorical variables. The other kind of variable is called a continuous variable. Continuous variables can have an infinite number of values, whereas categorical variables have concrete values. For this reason, categorical values require special attention in statistical analysis.
xxxxxxxxxx## Location DataLocation Data¶
Location data is information about a datapoint’s location in space, and can be expressed in latitude/longitude pairs, street address, altitude, or any other place-specific identifiers.
xxxxxxxxxx## Numerical DataNumerical Data¶
Numerical data is any information that can be represented by numbers.
xxxxxxxxxx# Summary StatisticsSummary Statistics¶
xxxxxxxxxx## Summary StatisticsSummary Statistics¶
Summary statistics are a set of simple calculations that help data scientists understand the broad strokes of their datasets.
xxxxxxxxxx## Working with Summary StatisticsWorking with Summary Statistics¶
To calculate summary statistics in pandas, use the describe method. We can generate summary statistics for the colombia-real-estate-1 dataset with code that looks like this:
import pandas as pddf1 = pd.read_csv("data/colombia-real-estate-1.csv")df1.describe()| lat | lon | area_m2 | |
|---|---|---|---|
| count | 2967.000000 | 2967.000000 | 3066.000000 |
| mean | 5.938198 | -74.512216 | 171.249511 |
| std | 2.638133 | 2.510621 | 89.761511 |
| min | -0.001299 | -81.385000 | 64.000000 |
| 25% | 4.671000 | -74.823000 | 100.000000 |
| 50% | 4.717000 | -74.078000 | 145.000000 |
| 75% | 5.070000 | -74.049000 | 222.000000 |
| max | 13.333000 | 0.001431 | 449.000000 |
xxxxxxxxxxBy default, the `describe`method will return `count`, `mean`, `standard deviations`, `minimum values`and `maximum values`. Also by default, the ignores ignores non-numerical columns.By default, the describemethod will return count, mean, standard deviations, minimum valuesand maximum values. Also by default, the ignores ignores non-numerical columns.
Practice
Try it yourself! Using the colombia-real-estate-2 dataset, create a DataFrame called df2, and print the resulting summary statistics.
df2 = pd.read_csv("data/colombia-real-estate-2.csv")xxxxxxxxxx# Calculate the Quantiles for a SeriesCalculate the Quantiles for a Series¶
xxxxxxxxxxQuantiles allow you to summarize the distribution of numerical values in a series. The `n'th` quantile divides an ordered series into `n` portions, each with the same number of entries. The boundaries between these portions are known as quantiles. Let's load a dataset to see how this works in practice:Quantiles allow you to summarize the distribution of numerical values in a series. The n'th quantile divides an ordered series into n portions, each with the same number of entries. The boundaries between these portions are known as quantiles. Let's load a dataset to see how this works in practice:
mexico_city2 = pd.read_csv("./data/mexico-city-real-estate-2.csv")mexico_city2.head<bound method NDFrame.head of operation property_type \
0 sell apartment
1 sell apartment
2 sell apartment
3 sell apartment
4 sell apartment
... ... ...
3786 sell apartment
3787 sell apartment
3788 sell apartment
3789 sell apartment
3790 sell apartment
place_with_parent_names \
0 |México|Distrito Federal|Benito Juárez|
1 |México|Distrito Federal|Tlalpan|
2 |México|Distrito Federal|Álvaro Obregón|Tetelpan|
3 |México|Distrito Federal|Benito Juárez|
4 |México|Distrito Federal|Benito Juárez|
... ...
3786 |México|Distrito Federal|Cuauhtémoc|
3787 |México|Distrito Federal|Benito Juárez|
3788 |México|Distrito Federal|Benito Juárez|
3789 |México|Distrito Federal|Iztapalapa|
3790 |México|Distrito Federal|Benito Juárez|
lat-lon price currency \
0 19.375445,-99.1543144 4300500.0 MXN
1 19.2742408542,-99.1496908665 2788000.0 MXN
2 19.3427189674,-99.2225289345 3351600.0 MXN
3 19.3596034,-99.1514055 2862800.0 MXN
4 19.3953378,-99.1560855 3204800.0 MXN
... ... ... ...
3786 19.418578,-99.17134 1893000.0 MXN
3787 19.377383,-99.144978 1425559.0 MXN
3788 19.40012,-99.1532 1774000.0 MXN
3789 19.353937,-99.06196 841000.0 MXN
3790 19.35782,-99.149406 1295000.0 MXN
price_aprox_local_currency price_aprox_usd surface_total_in_m2 \
0 4261596.45 226578.22 0.0
1 2762778.87 146889.91 0.0
2 3321280.35 176584.01 0.0
3 2836902.23 150830.86 73.0
4 3175808.45 168849.64 0.0
... ... ... ...
3786 1875875.33 99735.51 NaN
3787 1412662.98 75107.69 NaN
3788 1757951.87 93465.82 NaN
3789 833392.03 44309.33 NaN
3790 1283284.95 68228.99 NaN
surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms \
0 74.0 NaN 58114.864865 NaN NaN
1 111.0 NaN 25117.117117 NaN NaN
2 82.0 NaN 40873.170732 NaN NaN
3 73.0 2066.176164 39216.438356 NaN NaN
4 63.0 NaN 50869.841270 NaN NaN
... ... ... ... ... ...
3786 98.0 NaN 19316.326531 NaN NaN
3787 60.0 NaN 23759.316667 NaN NaN
3788 84.0 NaN 21119.047619 NaN NaN
3789 66.0 NaN 12742.424242 NaN NaN
3790 80.0 NaN 16187.500000 NaN NaN
expenses properati_url
0 NaN http://benito-juarez.properati.com.mx/l1yl_ven...
1 NaN http://tlalpan.properati.com.mx/l1ym_venta_dep...
2 NaN http://tetelpan.properati.com.mx/l1yn_venta_de...
3 NaN http://benito-juarez.properati.com.mx/l1yo_ven...
4 NaN http://benito-juarez.properati.com.mx/l1yp_ven...
... ... ...
3786 NaN http://cuauhtemoc.properati.com.mx/o45k_venta_...
3787 NaN http://benito-juarez.properati.com.mx/o45l_ven...
3788 NaN http://benito-juarez.properati.com.mx/o45m_ven...
3789 NaN http://iztapalapa.properati.com.mx/o45n_venta_...
3790 NaN http://benito-juarez.properati.com.mx/o45t_ven...
[3791 rows x 16 columns]>xxxxxxxxxxTo examine quantiles, let's pick the price columnTo examine quantiles, let's pick the price column
price = mexico_city2["price"]price0 4300500.0
1 2788000.0
2 3351600.0
3 2862800.0
4 3204800.0
...
3786 1893000.0
3787 1425559.0
3788 1774000.0
3789 841000.0
3790 1295000.0
Name: price, Length: 3791, dtype: float64xxxxxxxxxxThe median is the middle entry in the ordered list of prices:The median is the middle entry in the ordered list of prices:
price.quantile(0.5)2273333.0
xxxxxxxxxx## QuartilesQuartiles¶
A commonly used set of quantiles are the fourth quantiles known as quartiles. You can also find the minimum, first quartile, median, third quartile and maximum values in a series (which are typically the values used to create a boxplot):
price.quantile([0, 0.25, 0.5, 0.75, 1])0.00 200000.0 0.25 1180000.0 0.50 2273333.0 0.75 4500000.0 1.00 78000000.0 Name: price, dtype: float64
xxxxxxxxxx*Practice* What's the 0.7 quantile in the price column of `mexico-city-real-estate-3.csv`?Practice What's the 0.7 quantile in the price column of mexico-city-real-estate-3.csv?
mexico_city3 = ...price = ...print(price.quantile(...))--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In [11], line 3 1 mexico_city3 = ... 2 price = ... ----> 3 print(price.quantile(...)) AttributeError: 'ellipsis' object has no attribute 'quantile'
xxxxxxxxxx# CorrelationsCorrelations¶
xxxxxxxxxx**Correlations** tell us about the relationship between two sets of data. When we calculate this relationship, the result is a **correlation coefficient**. Correlation coefficients can have any value between -1 and 1. Values above 0 indicate a positive relationship (as one variable goes up, the other does too), and values below 0 indicate a negative relationship (as one variable goes up, the other goes down). The closer the coefficient's value is to either 1 or -1, the stronger the relationship is; the closer the coefficient's value is to 0, the weaker the relationship is. Coefficients equal to 0 indicate that there is no relationship between the two values, and are accordingly quite rare.Correlations tell us about the relationship between two sets of data. When we calculate this relationship, the result is a correlation coefficient. Correlation coefficients can have any value between -1 and 1. Values above 0 indicate a positive relationship (as one variable goes up, the other does too), and values below 0 indicate a negative relationship (as one variable goes up, the other goes down). The closer the coefficient's value is to either 1 or -1, the stronger the relationship is; the closer the coefficient's value is to 0, the weaker the relationship is. Coefficients equal to 0 indicate that there is no relationship between the two values, and are accordingly quite rare.
Let's run a correlation on some of the data from the colombia-real-estate-2 dataset. We might suspect that there is some kind of relationship between the price of a property and the area it occupies, so we'll use the Series.corr method to figure it out. The code looks like this:
area_m2 = df2["area_m2"]price_cop = df2["price_cop"]correlation = area_m2.corr(price_cop)print(correlation)xxxxxxxxxxThe correlation coefficient here is about 0.519, which is a moderate, positive correlation. That is, as the area of a property goes up, so does the price. If the result had been a negative number, we would be able to say that as the area goes up, the price goes down.The correlation coefficient here is about 0.519, which is a moderate, positive correlation. That is, as the area of a property goes up, so does the price. If the result had been a negative number, we would be able to say that as the area goes up, the price goes down.
Practice
Try it yourself! Find the relationship between "area_m2" and "price_usd" in the colombia-real-estate-3 dataset, and interpret the resulting coefficient.
df3 = ...print(correlation)xxxxxxxxxx# References & Further Readingxxxxxxxxxx---Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxx<font size="+3"><strong>Visualizing Data: Matplotlib</strong></font>Visualizing Data: Matplotlib
xxxxxxxxxxThere are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), Matplotlib, [plotly express](../%40textbook/08-visualization-plotly.ipynb), and [seaborn](../%40textbook/09-visualization-seaborn.ipynb). In this section, we'll focus on using Matplotlib.There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using Matplotlib.
Boxplots¶
A boxplot is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful, as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness.
Let's create a boxplot using the "area_m2" data from colombia-real-estate-1.
Note that the usecols argument allow us to only read in the columns we want, in this case "area_m2".
import matplotlib.pyplot as pltimport pandas as pddf = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["area_m2"])plt.boxplot(df["area_m2"])plt.ylabel("Area [sq. meters]")plt.title("Area in Square Meters");xxxxxxxxxxHere's how to interpret a boxplot. A boxplot always contains a box with two lines above and below it. Those lines are called **whiskers**, and the ends of those lines represent the upper and lower bounds of the dataset. There are also some values above the top whisker, but we'll come back to them a little later. In the meantime, notice that the plot divides the data into four sections: Here's how to interpret a boxplot. A boxplot always contains a box with two lines above and below it. Those lines are called whiskers, and the ends of those lines represent the upper and lower bounds of the dataset. There are also some values above the top whisker, but we'll come back to them a little later. In the meantime, notice that the plot divides the data into four sections:
- the bottom of the lower whisker to the bottom of the box,
- the bottom of the box to the line in the middle,
- the middle line to the top line of the box, and
- the top of the box to the end of the upper whisker.
These sections are called intervals, and the three lines that divide them are called quartiles. Each interval contains 25% of the observations in the dataset, which means that the box created by the first and third quartiles represents the middle 50% of observations. The second quartile — the orange line on our graph — represents the dataset's median value.
Keeping all that in mind, the boxplot here shows that while there are some very large properties in our Colombia dataset, half the properties tend to be much smaller. Remember those data points above the top whisker? Those are called outliers, and they represent values so extreme that they fall outside the meaningful spread of observations in the dataset. Since half of our properties are small — one-bedroom apartments, say — we don't really need to pay attention to eighty-room mansions nestled in the middle of thousand-acre estates. Not for this project anyway. Part of the purpose of making a boxplot is to find those outliers and discard them from future analyses.
Practice
Try it yourself! Create a boxplot using the price_cop data from colombia-real-estate-2.
df2 = ...xxxxxxxxxx# HistogramsHistograms¶
A histogram is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers.
Let's create a histogram using the "area_m2" data from colombia-real-estate-1.
df = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["area_m2"])plt.hist(df, bins=10, rwidth=0.9, color="b")plt.title("The Area of Real Estate in Colombia")plt.xlabel("Property Area")plt.ylabel("Number of Properties")plt.grid(axis="y", alpha=0.75);xxxxxxxxxxThere are two things to consider in this histogram. First and foremost is its interpretation: the distribution is skewed significantly to the left. In this case, the data suggest that the majority of properties in Colombia are smaller than 150 square meters.There are two things to consider in this histogram. First and foremost is its interpretation: the distribution is skewed significantly to the left. In this case, the data suggest that the majority of properties in Colombia are smaller than 150 square meters.
You might also have noticed that there are ten bars. In a histogram, we call these bars bins. A bin is simply a way to group data to make it easier to see trends. You can use as many or as few as you like; just recognize that the fewer bins you use, the less detailed the output will become.
Let's take a look at what the same data looks like when it's in 20 bins instead of 10.
df = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["area_m2"])# Notice that we changed `bins` from 10 to 20plt.hist(df, bins=20, rwidth=0.9, color="b")plt.title("The Area of Real Estate in Colombia")plt.xlabel("Property Area")plt.ylabel("Number of Properties")plt.grid(axis="y", alpha=0.75);xxxxxxxxxxThe line isn't quite as smooth as it was before. Depending on the story you want your histogram to tell, it might be more important for a trend to be smooth, or it might be more important to show smaller levels of variation. Here, we can still see that the properties in Colombia tend to be small, but the size of the properties are not evenly distributed. This histogram doesn't tell us anything about *why* this might be true, but it does suggest that there might be something more interesting going on in the background.The line isn't quite as smooth as it was before. Depending on the story you want your histogram to tell, it might be more important for a trend to be smooth, or it might be more important to show smaller levels of variation. Here, we can still see that the properties in Colombia tend to be small, but the size of the properties are not evenly distributed. This histogram doesn't tell us anything about why this might be true, but it does suggest that there might be something more interesting going on in the background.
xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Create two histograms using the price_cop data from Colombia Real Estate 2: one with five bins, and the other with 15 bins.
df2 = ...plt.title("The Area of Real Estate in Colombia")plt.xlabel("Property Area")plt.ylabel("Number of Properties")plt.grid(axis="y", alpha=0.75)plt.title("The Area of Real Estate in Colombia")plt.xlabel("Propert Area")plt.ylabel("Number of Properties")plt.grid(axis="y", alpha=0.75)xxxxxxxxxx# Bar ChartsBar Charts¶
A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.
Let's make a bar chart showing the number of properties in each borough, using the mexico-city-real-estate-1 dataset. First, let's pull out the values in the borough variable.
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")new_columns = ["empty1", "country", "city", "borough", "empty2"]mexico_city1[new_columns] = mexico_city1["place_with_parent_names"].str.split( "|", 4, expand=True)mexico_city1 = mexico_city1.drop(["empty1", "empty2"], axis=1)by_borough = mexico_city1["borough"].value_counts()by_borough/tmp/ipykernel_510/3713566005.py:4: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only. mexico_city1[new_columns] = mexico_city1["place_with_parent_names"].str.split(
Miguel Hidalgo 757
Benito Juárez 444
Álvaro Obregón 420
Cuajimalpa de Morelos 417
Iztapalapa 299
Tlalpan 295
Cuauhtémoc 256
Gustavo A. Madero 197
Coyoacán 159
Tláhuac 152
Venustiano Carranza 123
La Magdalena Contreras 73
Xochimilco 70
Azcapotzalco 62
Iztacalco 61
5
Milpa Alta 1
Name: borough, dtype: int64xxxxxxxxxxNow, let's make a bar chart of the Series.Now, let's make a bar chart of the Series.
plt.bar(x=by_borough.index, height=by_borough.values)plt.ylabel("Number of Properties")plt.xticks(rotation=90);xxxxxxxxxx<font size="+1">Practice</font> Practice
Try it yourself! Make a bar chart by property types in the mexico-city-real-estate-1.csv dataset.
by_property_type = ...plt.ylabel("Number of Properties")plt.xticks(rotation=90);xxxxxxxxxx# Scatter PlotsScatter Plots¶
A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.
Let's create a simple scatter plot using the "area_m2" data from colombia-real-estate-3. Note that we use the usecols argument below to import only two columns from the CSV file.
xxxxxxxxxxdf1 = pd.read_csv("data/colombia-real-estate-3.csv", usecols=["area_m2", "price_usd"])plt.scatter(df1["area_m2"], df1["price_usd"], color="r")plt.xlabel("Property Area")plt.ylabel("Price in US Dollars")plt.title("Property Area vs Price in US Dollars");xxxxxxxxxxThis scatter plot is showing us what we might already have suspected: there are lots of cheap, small properties, and as the size of the property goes up, so does the price. This scatter plot is showing us what we might already have suspected: there are lots of cheap, small properties, and as the size of the property goes up, so does the price.
Practice
Try it yourself! Create a scatter plot using the area_m2 and price_cop columns in the colombia-real-estate-2 dataset. Try changing the color argument to "b", "g", or "y", and see what happens!
xxxxxxxxxxdf2 = pd.read_csv( "data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"]) # REMOVERHSplt.xlabel("Property Area")plt.ylabel("Price in Colombian Pesos")plt.title("Property Area vs Price in Colombian Pesos");xxxxxxxxxxYou may have noticed that there are lots of data points here, which makes it difficult to see if there's really a trend going on. This issue is called **over-plotting**, and it's very common in large datasets.You may have noticed that there are lots of data points here, which makes it difficult to see if there's really a trend going on. This issue is called over-plotting, and it's very common in large datasets.
One solution to over-plotting is to use the df.sample method to select a random sample from the dataset. In the example below, note that the frac=0.50 indicates that the random sample will consist of 50% of the data points in the set.
xxxxxxxxxxdf3 = pd.read_csv("data/colombia-real-estate-3.csv", usecols=["area_m2", "price_usd"])df4 = df3.sample(frac=0.50, replace=True, random_state=1)plt.scatter(df4["area_m2"], df4["price_usd"], color="r")plt.xlabel("Property Area")plt.ylabel("Price in US Dollars")plt.title("Property Area vs Price in US Dollars");xxxxxxxxxxThat looks better, but things are still a little muddled. Let's try for a random sample that only looks at 25% of the data.That looks better, but things are still a little muddled. Let's try for a random sample that only looks at 25% of the data.
xxxxxxxxxxdf4 = df3.sample(frac=0.25, replace=True, random_state=1)plt.scatter(df4["area_m2"], df4["price_usd"], color="r")plt.xlabel("Property Area")plt.ylabel("Price in US Dollars")plt.title("Property Area vs Price in US Dollars");xxxxxxxxxxPerfect! With a smaller sample, it's much easier to see what's going on. Perfect! With a smaller sample, it's much easier to see what's going on.
Practice
Try it yourself! Using a random sample of 25% of the area_m2 and price_cop data from the colombia-rea-estate-3 data, create a green scatter plot.
xxxxxxxxxxdf3 = pd.read_csv("data/colombia-real-estate-3.csv", usecols=["area_m2", "price_usd"])df4 = ...xxxxxxxxxxWith the data as broadly scattered as this, it's unlikely that the two variables share a strong positive correlation, but let's calculate the **correlation coefficient,** just to be sure. With the data as broadly scattered as this, it's unlikely that the two variables share a strong positive correlation, but let's calculate the correlation coefficient, just to be sure.
To do that, we'll go back to the full dataset, and use the corr method. The code looks like this:
xxxxxxxxxxdf = pd.read_csv("data/colombia-real-estate-3.csv", usecols=["area_m2", "price_usd"])area_m2 = df["area_m2"]price_usd = df["price_usd"]correlation = area_m2.corr(price_usd)print(correlation)0.5250881127255975
xxxxxxxxxxThis indicates a moderate positive correlation between `"area_m2"` and `"price_usd"`, which is consistent with the scatter plot we made.This indicates a moderate positive correlation between "area_m2" and "price_usd", which is consistent with the scatter plot we made.
Practice
Try it yourself! Use the corr method to find the correlation coefficient of "area_m2" and "price_cop" in the colombia-real-estate-3 dataset.
xxxxxxxxxximport pandas as pddf = pd.read_csv("data/colombia-real-estate-3.csv", usecols=["area_m2", "price_usd"])area_m2 = df["area_m2"]price_usd = df["price_usd"]correlation = ...print(correlation)Ellipsis
xxxxxxxxxx# Add a vertical or horizontal line across a plotAdd a vertical or horizontal line across a plot¶
xxxxxxxxxxVertical lines can be added to an existing plot in Matplotlib using `plt.axvline`. For this example, we'll add a vertical line using the scatter plot created in the previous step:Vertical lines can be added to an existing plot in Matplotlib using plt.axvline. For this example, we'll add a vertical line using the scatter plot created in the previous step:
xxxxxxxxxxplt.scatter(df4["area_m2"], df4["price_usd"], color="r")plt.axvline(250, linestyle="--", color="blue", label="Line Title")plt.xlabel("Property Area")plt.ylabel("Price in US Dollars")plt.title("Property Area vs Price in US Dollars");--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In [17], line 1 ----> 1 plt.scatter(df4["area_m2"], df4["price_usd"], color="r") 2 plt.axvline(250, linestyle="--", color="blue", label="Line Title") 3 plt.xlabel("Property Area") TypeError: 'ellipsis' object is not subscriptable
xxxxxxxxxx<font size="+1">Practice</font>Practice
Try it yourself! Use the axvline function to plot a vertical line at 350 and at 150 on the x-axis using the same scatter plot from the previous example.
xxxxxxxxxx# Remove {plt.scatter(df4["area_m2"], df4["price_usd"], color="r")plt.axvline(150, linestyle="--", color="blue", label="Line Title")plt.axvline(350, linestyle="--", color="blue", label="Line Title")plt.xlabel("Property Area")plt.ylabel("Price in US Dollars")plt.title("Property Area vs Price in US Dollars");# Remove }xxxxxxxxxx# Sources & Further ReadingSources & Further Reading¶
- Example Boxplot from the Matplotlib Documentation
- Discussion of How to Interpret a Histogram
- Python Documentation on Histograms
- Pandas Official DataFrame Bar Plot Documentation
- Online Tutorial on Using Groupby and Count in Pandas
- Pandas Official Documentation on Splitting a String Entry in a Column
- Wikipedia Entry on Boroughs of Mexico City
- stackoverflow Entry on Adding Labels and Titles to Pandas Plots
- Further Information on Scatter Plots
- Scatter Plot Tutorial
- Subsetting Random Samples in a DataFrame
- Pandas
corrDocumentation
xxxxxxxxxx---Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxx<font size="+3"><strong>Visualizing Data: Matplotlib</strong></font>xxxxxxxxxxxxxxxxxxxx-
Variables
Callstack
Breakpoints
Source
xxxxxxxxxx- Boxplots
- Histograms
- Bar Charts
- Scatter Plots
- Add a vertical or horizontal line across a plot
- Sources & Further Reading